File size: 6,657 Bytes
2510974
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
813fb1c
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
# MOSS-TTS-Nano API Documentation

This document describes the fully modularized REST API endpoints exposed by MOSS-TTS-Nano. The API is powered by FastAPI, and interactive OpenAPI swagger docs can also be accessed directly at `http://127.0.0.1:18083/docs` when the server is running.

---

## 1. Synchronous Synthesis Endpoint

### `POST /api/generate`
Generates a complete synthesized audio file for a given text and reference prompt.

- **Request Type**: `multipart/form-data`
- **Parameters**:

| Parameter | Type | Default | Description |
| :--- | :--- | :--- | :--- |
| `text` | `string` | *Required* | The target text to be synthesized. |
| `demo_id` | `string` | `""` | The identifier of a preset voice (e.g. `zh_1`, `en_3`). Required if `prompt_audio` is not provided. |
| `prompt_audio` | `file` | `None` | Reference audio file for zero-shot voice cloning. Overrides `demo_id` if uploaded. |
| `max_new_frames` | `integer` | `375` | Maximum number of new frames generated by the model. |
| `voice_clone_max_text_tokens` | `integer` | `75` | Maximum context length for text-based voice cloning. |
| `tts_max_batch_size` | `integer` | `0` | Batch size constraint for TTS generation (0 = auto). |
| `codec_max_batch_size` | `integer` | `0` | Batch size constraint for Codec decoding (0 = auto). |
| `enable_text_normalization` | `string` | `"1"` | Enables WeTextProcessing frontend (`"1"` or `"0"`). |
| `enable_normalize_tts_text` | `string` | `"1"` | Enables secondary robust text normalization (`"1"` or `"0"`). |
| `cpu_threads` | `integer` | `0` | Target CPU threads. 0 leverages system settings. |
| `attn_implementation` | `string` | `"model_default"` | Attention implementation (e.g. `"model_default"`, `"sdpa"`, `"eager"`). |
| `do_sample` | `string` | `"1"` | Whether to use sampling during decoding (`"1"` or `"0"`). |
| `text_temperature` | `float` | `1.0` | Sampling temperature for text tokens. |
| `text_top_p` | `float` | `1.0` | Top-p filtering threshold for text. |
| `text_top_k` | `integer` | `50` | Top-k filtering threshold for text. |
| `audio_temperature` | `float` | `0.8` | Sampling temperature for audio tokens. |
| `audio_top_p` | `float` | `0.95` | Top-p filtering threshold for audio. |
| `audio_top_k` | `integer` | `25` | Top-k filtering threshold for audio. |
| `audio_repetition_penalty` | `float` | `1.2` | Repetition penalty scaling factor for audio generation. |
| `seed` | `string` | `"0"` | Random seed (set to `"0"` or empty string for random/non-deterministic). |

- **Response Example (`200 OK`)**:
```json
{
  "audio_base64": "UklGRi...",
  "sample_rate": 48000,
  "run_status": "Time: 1.2s | Speed: 25.4 frames/s",
  "prompt_audio_path": "assets/audio/zh_1.wav",
  "warmup_status_text": "Model Ready",
  "text_chunks": ["Welcome to MOSS TTS."],
  "normalized_text": "welcome to moss tts",
  "normalization_method": "wetext",
  "text_normalization_language": "en"
}
```

---

## 2. Asynchronous Streaming API

Streaming endpoints are designed for ultra-low latency playback, delivering chunked PCM audio packets while the model is still executing.

### `POST /api/generate-stream/start`
Starts a background streaming generation job and registers a `stream_id`.

- **Request Type**: `multipart/form-data`
- **Parameters**: Same parameters as `/api/generate`.
- **Response Example (`200 OK`)**:
```json
{
  "stream_id": "job_a1b2c3d4",
  "audio_url": "/api/generate-stream/job_a1b2c3d4/audio",
  "status_url": "/api/generate-stream/job_a1b2c3d4/status",
  "result_url": "/api/generate-stream/job_a1b2c3d4/result",
  "sample_rate": 48000,
  "channels": 1,
  "run_status": "Streaming realtime audio...",
  "prompt_audio_path": "assets/audio/zh_1.wav",
  "text_chunks": ["Welcome to the streaming engine."]
}
```

### `GET /api/generate-stream/{stream_id}/status`
Polls the active streaming status, progress metrics, and elapsed times.

- **Response Example (`200 OK`)**:
```json
{
  "stream_id": "job_a1b2c3d4",
  "state": "running",
  "ready": false,
  "failed": false,
  "emitted_audio_seconds": 2.45,
  "lead_seconds": 1.28,
  "first_audio_latency_seconds": 0.45,
  "status_text": "Streaming...",
  "stream_metrics": "state=running | emitted=2.45s | lead=1.28s | first_audio=0.45s"
}
```

### `GET /api/generate-stream/{stream_id}/audio`
Streams raw audio packets from the generation queue.

- **Response Type**: Chunked binary stream (`application/octet-stream`)
- **Format**: Raw PCM 16-bit Little-Endian (`pcm_s16le`)
- **Headers**:
  - `X-Audio-Codec`: `pcm_s16le`
  - `X-Audio-Sample-Rate`: `48000`
  - `X-Audio-Channels`: `1`

### `GET /api/generate-stream/{stream_id}/result`
Retrieves the finished synthesis payload (including fully-stitched base64-encoded audio) and chunk-to-timestamp range indexes once the job completes.

- **Response Example (`200 OK`)**:
```json
{
  "stream_id": "job_a1b2c3d4",
  "ready": true,
  "state": "done",
  "prompt_audio_path": "assets/audio/zh_1.wav",
  "run_status": "Finished successfully.",
  "stream_metrics": "state=done | emitted=4.50s",
  "audio_chunk_ranges": [
    [0.0, 4.5, 0]
  ],
  "audio_base64": "UklGRi..."
}
```

### `POST /api/generate-stream/{stream_id}/close`
Closes the active queue session, terminates background threads, and purges intermediate disk caches.

---

## 3. Metadata & Health Endpoints

### `GET /api/metadata`
Returns information regarding loaded tokenizer weights, hardware attention defaults, and runtime modules.

- **Response Example (`200 OK`)**:
```json
{
  "is_onnx": false,
  "is_cpu_only": true,
  "execution_provider": "cpu",
  "device_type": "cpu",
  "checkpoint_default_attn_implementation": "eager",
  "checkpoint_path": "models/MOSS-TTS-Nano-100M",
  "audio_tokenizer_path": "models/MOSS-Audio-Tokenizer-Nano",
  "text_normalization_status": "WeTextProcessing ready."
}
```

### `GET /api/warmup-status`
Returns model graph pre-warming status.

- **Response Example (`200 OK`)**:
```json
{
  "ready": true,
  "progress": 1.0,
  "message": "Warmup completed successfully.",
  "failed": false,
  "status_text": "Warmup Ready"
}
```

### `GET /api/text-normalization-status`
Returns Chinese/English front-end WeTextProcessing engine diagnostics.

---

## 4. Preset Preserves

### `GET /api/demo-prompt-audio/{demo_id}`
Streams a reference preset voice file directly.

- **Parameters**: `demo_id` (e.g. `zh_1`, `en_3`).
- **Response**: Binary audio file stream (`audio/wav`).

---

## 5. Live Deployments

You can interact with a live deployment of this refactored, minimalist modular version here:
- **Hugging Face Space (Refactored)**: [Tom199328/MOSS-TTS-Nano](https://huggingface.co/spaces/Tom199328/MOSS-TTS-Nano)