Spaces:
Running
MOSS-TTS-Nano API Documentation
This document describes the fully modularized REST API endpoints exposed by MOSS-TTS-Nano. The API is powered by FastAPI, and interactive OpenAPI swagger docs can also be accessed directly at http://127.0.0.1:18083/docs when the server is running.
1. Synchronous Synthesis Endpoint
POST /api/generate
Generates a complete synthesized audio file for a given text and reference prompt.
- Request Type:
multipart/form-data - Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
string |
Required | The target text to be synthesized. |
demo_id |
string |
"" |
The identifier of a preset voice (e.g. zh_1, en_3). Required if prompt_audio is not provided. |
prompt_audio |
file |
None |
Reference audio file for zero-shot voice cloning. Overrides demo_id if uploaded. |
max_new_frames |
integer |
375 |
Maximum number of new frames generated by the model. |
voice_clone_max_text_tokens |
integer |
75 |
Maximum context length for text-based voice cloning. |
tts_max_batch_size |
integer |
0 |
Batch size constraint for TTS generation (0 = auto). |
codec_max_batch_size |
integer |
0 |
Batch size constraint for Codec decoding (0 = auto). |
enable_text_normalization |
string |
"1" |
Enables WeTextProcessing frontend ("1" or "0"). |
enable_normalize_tts_text |
string |
"1" |
Enables secondary robust text normalization ("1" or "0"). |
cpu_threads |
integer |
0 |
Target CPU threads. 0 leverages system settings. |
attn_implementation |
string |
"model_default" |
Attention implementation (e.g. "model_default", "sdpa", "eager"). |
do_sample |
string |
"1" |
Whether to use sampling during decoding ("1" or "0"). |
text_temperature |
float |
1.0 |
Sampling temperature for text tokens. |
text_top_p |
float |
1.0 |
Top-p filtering threshold for text. |
text_top_k |
integer |
50 |
Top-k filtering threshold for text. |
audio_temperature |
float |
0.8 |
Sampling temperature for audio tokens. |
audio_top_p |
float |
0.95 |
Top-p filtering threshold for audio. |
audio_top_k |
integer |
25 |
Top-k filtering threshold for audio. |
audio_repetition_penalty |
float |
1.2 |
Repetition penalty scaling factor for audio generation. |
seed |
string |
"0" |
Random seed (set to "0" or empty string for random/non-deterministic). |
- Response Example (
200 OK):
{
"audio_base64": "UklGRi...",
"sample_rate": 48000,
"run_status": "Time: 1.2s | Speed: 25.4 frames/s",
"prompt_audio_path": "assets/audio/zh_1.wav",
"warmup_status_text": "Model Ready",
"text_chunks": ["Welcome to MOSS TTS."],
"normalized_text": "welcome to moss tts",
"normalization_method": "wetext",
"text_normalization_language": "en"
}
2. Asynchronous Streaming API
Streaming endpoints are designed for ultra-low latency playback, delivering chunked PCM audio packets while the model is still executing.
POST /api/generate-stream/start
Starts a background streaming generation job and registers a stream_id.
- Request Type:
multipart/form-data - Parameters: Same parameters as
/api/generate. - Response Example (
200 OK):
{
"stream_id": "job_a1b2c3d4",
"audio_url": "/api/generate-stream/job_a1b2c3d4/audio",
"status_url": "/api/generate-stream/job_a1b2c3d4/status",
"result_url": "/api/generate-stream/job_a1b2c3d4/result",
"sample_rate": 48000,
"channels": 1,
"run_status": "Streaming realtime audio...",
"prompt_audio_path": "assets/audio/zh_1.wav",
"text_chunks": ["Welcome to the streaming engine."]
}
GET /api/generate-stream/{stream_id}/status
Polls the active streaming status, progress metrics, and elapsed times.
- Response Example (
200 OK):
{
"stream_id": "job_a1b2c3d4",
"state": "running",
"ready": false,
"failed": false,
"emitted_audio_seconds": 2.45,
"lead_seconds": 1.28,
"first_audio_latency_seconds": 0.45,
"status_text": "Streaming...",
"stream_metrics": "state=running | emitted=2.45s | lead=1.28s | first_audio=0.45s"
}
GET /api/generate-stream/{stream_id}/audio
Streams raw audio packets from the generation queue.
- Response Type: Chunked binary stream (
application/octet-stream) - Format: Raw PCM 16-bit Little-Endian (
pcm_s16le) - Headers:
X-Audio-Codec:pcm_s16leX-Audio-Sample-Rate:48000X-Audio-Channels:1
GET /api/generate-stream/{stream_id}/result
Retrieves the finished synthesis payload (including fully-stitched base64-encoded audio) and chunk-to-timestamp range indexes once the job completes.
- Response Example (
200 OK):
{
"stream_id": "job_a1b2c3d4",
"ready": true,
"state": "done",
"prompt_audio_path": "assets/audio/zh_1.wav",
"run_status": "Finished successfully.",
"stream_metrics": "state=done | emitted=4.50s",
"audio_chunk_ranges": [
[0.0, 4.5, 0]
],
"audio_base64": "UklGRi..."
}
POST /api/generate-stream/{stream_id}/close
Closes the active queue session, terminates background threads, and purges intermediate disk caches.
3. Metadata & Health Endpoints
GET /api/metadata
Returns information regarding loaded tokenizer weights, hardware attention defaults, and runtime modules.
- Response Example (
200 OK):
{
"is_onnx": false,
"is_cpu_only": true,
"execution_provider": "cpu",
"device_type": "cpu",
"checkpoint_default_attn_implementation": "eager",
"checkpoint_path": "models/MOSS-TTS-Nano-100M",
"audio_tokenizer_path": "models/MOSS-Audio-Tokenizer-Nano",
"text_normalization_status": "WeTextProcessing ready."
}
GET /api/warmup-status
Returns model graph pre-warming status.
- Response Example (
200 OK):
{
"ready": true,
"progress": 1.0,
"message": "Warmup completed successfully.",
"failed": false,
"status_text": "Warmup Ready"
}
GET /api/text-normalization-status
Returns Chinese/English front-end WeTextProcessing engine diagnostics.
4. Preset Preserves
GET /api/demo-prompt-audio/{demo_id}
Streams a reference preset voice file directly.
- Parameters:
demo_id(e.g.zh_1,en_3). - Response: Binary audio file stream (
audio/wav).
5. Live Deployments
You can interact with a live deployment of this refactored, minimalist modular version here:
- Hugging Face Space (Refactored): Tom199328/MOSS-TTS-Nano