Spaces:
Running
Running
| # MOSS-TTS-Nano API Documentation | |
| This document describes the fully modularized REST API endpoints exposed by MOSS-TTS-Nano. The API is powered by FastAPI, and interactive OpenAPI swagger docs can also be accessed directly at `http://127.0.0.1:18083/docs` when the server is running. | |
| --- | |
| ## 1. Synchronous Synthesis Endpoint | |
| ### `POST /api/generate` | |
| Generates a complete synthesized audio file for a given text and reference prompt. | |
| - **Request Type**: `multipart/form-data` | |
| - **Parameters**: | |
| | Parameter | Type | Default | Description | | |
| | :--- | :--- | :--- | :--- | | |
| | `text` | `string` | *Required* | The target text to be synthesized. | | |
| | `demo_id` | `string` | `""` | The identifier of a preset voice (e.g. `zh_1`, `en_3`). Required if `prompt_audio` is not provided. | | |
| | `prompt_audio` | `file` | `None` | Reference audio file for zero-shot voice cloning. Overrides `demo_id` if uploaded. | | |
| | `max_new_frames` | `integer` | `375` | Maximum number of new frames generated by the model. | | |
| | `voice_clone_max_text_tokens` | `integer` | `75` | Maximum context length for text-based voice cloning. | | |
| | `tts_max_batch_size` | `integer` | `0` | Batch size constraint for TTS generation (0 = auto). | | |
| | `codec_max_batch_size` | `integer` | `0` | Batch size constraint for Codec decoding (0 = auto). | | |
| | `enable_text_normalization` | `string` | `"1"` | Enables WeTextProcessing frontend (`"1"` or `"0"`). | | |
| | `enable_normalize_tts_text` | `string` | `"1"` | Enables secondary robust text normalization (`"1"` or `"0"`). | | |
| | `cpu_threads` | `integer` | `0` | Target CPU threads. 0 leverages system settings. | | |
| | `attn_implementation` | `string` | `"model_default"` | Attention implementation (e.g. `"model_default"`, `"sdpa"`, `"eager"`). | | |
| | `do_sample` | `string` | `"1"` | Whether to use sampling during decoding (`"1"` or `"0"`). | | |
| | `text_temperature` | `float` | `1.0` | Sampling temperature for text tokens. | | |
| | `text_top_p` | `float` | `1.0` | Top-p filtering threshold for text. | | |
| | `text_top_k` | `integer` | `50` | Top-k filtering threshold for text. | | |
| | `audio_temperature` | `float` | `0.8` | Sampling temperature for audio tokens. | | |
| | `audio_top_p` | `float` | `0.95` | Top-p filtering threshold for audio. | | |
| | `audio_top_k` | `integer` | `25` | Top-k filtering threshold for audio. | | |
| | `audio_repetition_penalty` | `float` | `1.2` | Repetition penalty scaling factor for audio generation. | | |
| | `seed` | `string` | `"0"` | Random seed (set to `"0"` or empty string for random/non-deterministic). | | |
| - **Response Example (`200 OK`)**: | |
| ```json | |
| { | |
| "audio_base64": "UklGRi...", | |
| "sample_rate": 48000, | |
| "run_status": "Time: 1.2s | Speed: 25.4 frames/s", | |
| "prompt_audio_path": "assets/audio/zh_1.wav", | |
| "warmup_status_text": "Model Ready", | |
| "text_chunks": ["Welcome to MOSS TTS."], | |
| "normalized_text": "welcome to moss tts", | |
| "normalization_method": "wetext", | |
| "text_normalization_language": "en" | |
| } | |
| ``` | |
| --- | |
| ## 2. Asynchronous Streaming API | |
| Streaming endpoints are designed for ultra-low latency playback, delivering chunked PCM audio packets while the model is still executing. | |
| ### `POST /api/generate-stream/start` | |
| Starts a background streaming generation job and registers a `stream_id`. | |
| - **Request Type**: `multipart/form-data` | |
| - **Parameters**: Same parameters as `/api/generate`. | |
| - **Response Example (`200 OK`)**: | |
| ```json | |
| { | |
| "stream_id": "job_a1b2c3d4", | |
| "audio_url": "/api/generate-stream/job_a1b2c3d4/audio", | |
| "status_url": "/api/generate-stream/job_a1b2c3d4/status", | |
| "result_url": "/api/generate-stream/job_a1b2c3d4/result", | |
| "sample_rate": 48000, | |
| "channels": 1, | |
| "run_status": "Streaming realtime audio...", | |
| "prompt_audio_path": "assets/audio/zh_1.wav", | |
| "text_chunks": ["Welcome to the streaming engine."] | |
| } | |
| ``` | |
| ### `GET /api/generate-stream/{stream_id}/status` | |
| Polls the active streaming status, progress metrics, and elapsed times. | |
| - **Response Example (`200 OK`)**: | |
| ```json | |
| { | |
| "stream_id": "job_a1b2c3d4", | |
| "state": "running", | |
| "ready": false, | |
| "failed": false, | |
| "emitted_audio_seconds": 2.45, | |
| "lead_seconds": 1.28, | |
| "first_audio_latency_seconds": 0.45, | |
| "status_text": "Streaming...", | |
| "stream_metrics": "state=running | emitted=2.45s | lead=1.28s | first_audio=0.45s" | |
| } | |
| ``` | |
| ### `GET /api/generate-stream/{stream_id}/audio` | |
| Streams raw audio packets from the generation queue. | |
| - **Response Type**: Chunked binary stream (`application/octet-stream`) | |
| - **Format**: Raw PCM 16-bit Little-Endian (`pcm_s16le`) | |
| - **Headers**: | |
| - `X-Audio-Codec`: `pcm_s16le` | |
| - `X-Audio-Sample-Rate`: `48000` | |
| - `X-Audio-Channels`: `1` | |
| ### `GET /api/generate-stream/{stream_id}/result` | |
| Retrieves the finished synthesis payload (including fully-stitched base64-encoded audio) and chunk-to-timestamp range indexes once the job completes. | |
| - **Response Example (`200 OK`)**: | |
| ```json | |
| { | |
| "stream_id": "job_a1b2c3d4", | |
| "ready": true, | |
| "state": "done", | |
| "prompt_audio_path": "assets/audio/zh_1.wav", | |
| "run_status": "Finished successfully.", | |
| "stream_metrics": "state=done | emitted=4.50s", | |
| "audio_chunk_ranges": [ | |
| [0.0, 4.5, 0] | |
| ], | |
| "audio_base64": "UklGRi..." | |
| } | |
| ``` | |
| ### `POST /api/generate-stream/{stream_id}/close` | |
| Closes the active queue session, terminates background threads, and purges intermediate disk caches. | |
| --- | |
| ## 3. Metadata & Health Endpoints | |
| ### `GET /api/metadata` | |
| Returns information regarding loaded tokenizer weights, hardware attention defaults, and runtime modules. | |
| - **Response Example (`200 OK`)**: | |
| ```json | |
| { | |
| "is_onnx": false, | |
| "is_cpu_only": true, | |
| "execution_provider": "cpu", | |
| "device_type": "cpu", | |
| "checkpoint_default_attn_implementation": "eager", | |
| "checkpoint_path": "models/MOSS-TTS-Nano-100M", | |
| "audio_tokenizer_path": "models/MOSS-Audio-Tokenizer-Nano", | |
| "text_normalization_status": "WeTextProcessing ready." | |
| } | |
| ``` | |
| ### `GET /api/warmup-status` | |
| Returns model graph pre-warming status. | |
| - **Response Example (`200 OK`)**: | |
| ```json | |
| { | |
| "ready": true, | |
| "progress": 1.0, | |
| "message": "Warmup completed successfully.", | |
| "failed": false, | |
| "status_text": "Warmup Ready" | |
| } | |
| ``` | |
| ### `GET /api/text-normalization-status` | |
| Returns Chinese/English front-end WeTextProcessing engine diagnostics. | |
| --- | |
| ## 4. Preset Preserves | |
| ### `GET /api/demo-prompt-audio/{demo_id}` | |
| Streams a reference preset voice file directly. | |
| - **Parameters**: `demo_id` (e.g. `zh_1`, `en_3`). | |
| - **Response**: Binary audio file stream (`audio/wav`). | |
| --- | |
| ## 5. Live Deployments | |
| You can interact with a live deployment of this refactored, minimalist modular version here: | |
| - **Hugging Face Space (Refactored)**: [Tom199328/MOSS-TTS-Nano](https://huggingface.co/spaces/Tom199328/MOSS-TTS-Nano) | |