# MOSS-TTS-Nano API Documentation This document describes the fully modularized REST API endpoints exposed by MOSS-TTS-Nano. The API is powered by FastAPI, and interactive OpenAPI swagger docs can also be accessed directly at `http://127.0.0.1:18083/docs` when the server is running. --- ## 1. Synchronous Synthesis Endpoint ### `POST /api/generate` Generates a complete synthesized audio file for a given text and reference prompt. - **Request Type**: `multipart/form-data` - **Parameters**: | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `text` | `string` | *Required* | The target text to be synthesized. | | `demo_id` | `string` | `""` | The identifier of a preset voice (e.g. `zh_1`, `en_3`). Required if `prompt_audio` is not provided. | | `prompt_audio` | `file` | `None` | Reference audio file for zero-shot voice cloning. Overrides `demo_id` if uploaded. | | `max_new_frames` | `integer` | `375` | Maximum number of new frames generated by the model. | | `voice_clone_max_text_tokens` | `integer` | `75` | Maximum context length for text-based voice cloning. | | `tts_max_batch_size` | `integer` | `0` | Batch size constraint for TTS generation (0 = auto). | | `codec_max_batch_size` | `integer` | `0` | Batch size constraint for Codec decoding (0 = auto). | | `enable_text_normalization` | `string` | `"1"` | Enables WeTextProcessing frontend (`"1"` or `"0"`). | | `enable_normalize_tts_text` | `string` | `"1"` | Enables secondary robust text normalization (`"1"` or `"0"`). | | `cpu_threads` | `integer` | `0` | Target CPU threads. 0 leverages system settings. | | `attn_implementation` | `string` | `"model_default"` | Attention implementation (e.g. `"model_default"`, `"sdpa"`, `"eager"`). | | `do_sample` | `string` | `"1"` | Whether to use sampling during decoding (`"1"` or `"0"`). | | `text_temperature` | `float` | `1.0` | Sampling temperature for text tokens. | | `text_top_p` | `float` | `1.0` | Top-p filtering threshold for text. | | `text_top_k` | `integer` | `50` | Top-k filtering threshold for text. | | `audio_temperature` | `float` | `0.8` | Sampling temperature for audio tokens. | | `audio_top_p` | `float` | `0.95` | Top-p filtering threshold for audio. | | `audio_top_k` | `integer` | `25` | Top-k filtering threshold for audio. | | `audio_repetition_penalty` | `float` | `1.2` | Repetition penalty scaling factor for audio generation. | | `seed` | `string` | `"0"` | Random seed (set to `"0"` or empty string for random/non-deterministic). | - **Response Example (`200 OK`)**: ```json { "audio_base64": "UklGRi...", "sample_rate": 48000, "run_status": "Time: 1.2s | Speed: 25.4 frames/s", "prompt_audio_path": "assets/audio/zh_1.wav", "warmup_status_text": "Model Ready", "text_chunks": ["Welcome to MOSS TTS."], "normalized_text": "welcome to moss tts", "normalization_method": "wetext", "text_normalization_language": "en" } ``` --- ## 2. Asynchronous Streaming API Streaming endpoints are designed for ultra-low latency playback, delivering chunked PCM audio packets while the model is still executing. ### `POST /api/generate-stream/start` Starts a background streaming generation job and registers a `stream_id`. - **Request Type**: `multipart/form-data` - **Parameters**: Same parameters as `/api/generate`. - **Response Example (`200 OK`)**: ```json { "stream_id": "job_a1b2c3d4", "audio_url": "/api/generate-stream/job_a1b2c3d4/audio", "status_url": "/api/generate-stream/job_a1b2c3d4/status", "result_url": "/api/generate-stream/job_a1b2c3d4/result", "sample_rate": 48000, "channels": 1, "run_status": "Streaming realtime audio...", "prompt_audio_path": "assets/audio/zh_1.wav", "text_chunks": ["Welcome to the streaming engine."] } ``` ### `GET /api/generate-stream/{stream_id}/status` Polls the active streaming status, progress metrics, and elapsed times. - **Response Example (`200 OK`)**: ```json { "stream_id": "job_a1b2c3d4", "state": "running", "ready": false, "failed": false, "emitted_audio_seconds": 2.45, "lead_seconds": 1.28, "first_audio_latency_seconds": 0.45, "status_text": "Streaming...", "stream_metrics": "state=running | emitted=2.45s | lead=1.28s | first_audio=0.45s" } ``` ### `GET /api/generate-stream/{stream_id}/audio` Streams raw audio packets from the generation queue. - **Response Type**: Chunked binary stream (`application/octet-stream`) - **Format**: Raw PCM 16-bit Little-Endian (`pcm_s16le`) - **Headers**: - `X-Audio-Codec`: `pcm_s16le` - `X-Audio-Sample-Rate`: `48000` - `X-Audio-Channels`: `1` ### `GET /api/generate-stream/{stream_id}/result` Retrieves the finished synthesis payload (including fully-stitched base64-encoded audio) and chunk-to-timestamp range indexes once the job completes. - **Response Example (`200 OK`)**: ```json { "stream_id": "job_a1b2c3d4", "ready": true, "state": "done", "prompt_audio_path": "assets/audio/zh_1.wav", "run_status": "Finished successfully.", "stream_metrics": "state=done | emitted=4.50s", "audio_chunk_ranges": [ [0.0, 4.5, 0] ], "audio_base64": "UklGRi..." } ``` ### `POST /api/generate-stream/{stream_id}/close` Closes the active queue session, terminates background threads, and purges intermediate disk caches. --- ## 3. Metadata & Health Endpoints ### `GET /api/metadata` Returns information regarding loaded tokenizer weights, hardware attention defaults, and runtime modules. - **Response Example (`200 OK`)**: ```json { "is_onnx": false, "is_cpu_only": true, "execution_provider": "cpu", "device_type": "cpu", "checkpoint_default_attn_implementation": "eager", "checkpoint_path": "models/MOSS-TTS-Nano-100M", "audio_tokenizer_path": "models/MOSS-Audio-Tokenizer-Nano", "text_normalization_status": "WeTextProcessing ready." } ``` ### `GET /api/warmup-status` Returns model graph pre-warming status. - **Response Example (`200 OK`)**: ```json { "ready": true, "progress": 1.0, "message": "Warmup completed successfully.", "failed": false, "status_text": "Warmup Ready" } ``` ### `GET /api/text-normalization-status` Returns Chinese/English front-end WeTextProcessing engine diagnostics. --- ## 4. Preset Preserves ### `GET /api/demo-prompt-audio/{demo_id}` Streams a reference preset voice file directly. - **Parameters**: `demo_id` (e.g. `zh_1`, `en_3`). - **Response**: Binary audio file stream (`audio/wav`). --- ## 5. Live Deployments You can interact with a live deployment of this refactored, minimalist modular version here: - **Hugging Face Space (Refactored)**: [Tom199328/MOSS-TTS-Nano](https://huggingface.co/spaces/Tom199328/MOSS-TTS-Nano)