MOSS-TTS-Nano

Running

App Files Files Community

MOSS-TTS-Nano / API_DOCUMENTATION.md

Tom199328

Upload folder using huggingface_hub

813fb1c verified 11 days ago

preview code

raw

history blame contribute delete

6.66 kB

	# MOSS-TTS-Nano API Documentation

	This document describes the fully modularized REST API endpoints exposed by MOSS-TTS-Nano. The API is powered by FastAPI, and interactive OpenAPI swagger docs can also be accessed directly at `http://127.0.0.1:18083/docs` when the server is running.

	---

	## 1. Synchronous Synthesis Endpoint

	### `POST /api/generate`
	Generates a complete synthesized audio file for a given text and reference prompt.

	- Request Type: `multipart/form-data`
	- Parameters:

	\| Parameter \| Type \| Default \| Description \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| `text` \| `string` \| Required \| The target text to be synthesized. \|
	\| `demo_id` \| `string` \| `""` \| The identifier of a preset voice (e.g. `zh_1`, `en_3`). Required if `prompt_audio` is not provided. \|
	\| `prompt_audio` \| `file` \| `None` \| Reference audio file for zero-shot voice cloning. Overrides `demo_id` if uploaded. \|
	\| `max_new_frames` \| `integer` \| `375` \| Maximum number of new frames generated by the model. \|
	\| `voice_clone_max_text_tokens` \| `integer` \| `75` \| Maximum context length for text-based voice cloning. \|
	\| `tts_max_batch_size` \| `integer` \| `0` \| Batch size constraint for TTS generation (0 = auto). \|
	\| `codec_max_batch_size` \| `integer` \| `0` \| Batch size constraint for Codec decoding (0 = auto). \|
	\| `enable_text_normalization` \| `string` \| `"1"` \| Enables WeTextProcessing frontend (`"1"` or `"0"`). \|
	\| `enable_normalize_tts_text` \| `string` \| `"1"` \| Enables secondary robust text normalization (`"1"` or `"0"`). \|
	\| `cpu_threads` \| `integer` \| `0` \| Target CPU threads. 0 leverages system settings. \|
	\| `attn_implementation` \| `string` \| `"model_default"` \| Attention implementation (e.g. `"model_default"`, `"sdpa"`, `"eager"`). \|
	\| `do_sample` \| `string` \| `"1"` \| Whether to use sampling during decoding (`"1"` or `"0"`). \|
	\| `text_temperature` \| `float` \| `1.0` \| Sampling temperature for text tokens. \|
	\| `text_top_p` \| `float` \| `1.0` \| Top-p filtering threshold for text. \|
	\| `text_top_k` \| `integer` \| `50` \| Top-k filtering threshold for text. \|
	\| `audio_temperature` \| `float` \| `0.8` \| Sampling temperature for audio tokens. \|
	\| `audio_top_p` \| `float` \| `0.95` \| Top-p filtering threshold for audio. \|
	\| `audio_top_k` \| `integer` \| `25` \| Top-k filtering threshold for audio. \|
	\| `audio_repetition_penalty` \| `float` \| `1.2` \| Repetition penalty scaling factor for audio generation. \|
	\| `seed` \| `string` \| `"0"` \| Random seed (set to `"0"` or empty string for random/non-deterministic). \|

	- Response Example (`200 OK`):
	```json
	{
	"audio_base64": "UklGRi...",
	"sample_rate": 48000,
	"run_status": "Time: 1.2s \| Speed: 25.4 frames/s",
	"prompt_audio_path": "assets/audio/zh_1.wav",
	"warmup_status_text": "Model Ready",
	"text_chunks": ["Welcome to MOSS TTS."],
	"normalized_text": "welcome to moss tts",
	"normalization_method": "wetext",
	"text_normalization_language": "en"
	}
	```

	---

	## 2. Asynchronous Streaming API

	Streaming endpoints are designed for ultra-low latency playback, delivering chunked PCM audio packets while the model is still executing.

	### `POST /api/generate-stream/start`
	Starts a background streaming generation job and registers a `stream_id`.

	- Request Type: `multipart/form-data`
	- Parameters: Same parameters as `/api/generate`.
	- Response Example (`200 OK`):
	```json
	{
	"stream_id": "job_a1b2c3d4",
	"audio_url": "/api/generate-stream/job_a1b2c3d4/audio",
	"status_url": "/api/generate-stream/job_a1b2c3d4/status",
	"result_url": "/api/generate-stream/job_a1b2c3d4/result",
	"sample_rate": 48000,
	"channels": 1,
	"run_status": "Streaming realtime audio...",
	"prompt_audio_path": "assets/audio/zh_1.wav",
	"text_chunks": ["Welcome to the streaming engine."]
	}
	```

	### `GET /api/generate-stream/{stream_id}/status`
	Polls the active streaming status, progress metrics, and elapsed times.

	- Response Example (`200 OK`):
	```json
	{
	"stream_id": "job_a1b2c3d4",
	"state": "running",
	"ready": false,
	"failed": false,
	"emitted_audio_seconds": 2.45,
	"lead_seconds": 1.28,
	"first_audio_latency_seconds": 0.45,
	"status_text": "Streaming...",
	"stream_metrics": "state=running \| emitted=2.45s \| lead=1.28s \| first_audio=0.45s"
	}
	```

	### `GET /api/generate-stream/{stream_id}/audio`
	Streams raw audio packets from the generation queue.

	- Response Type: Chunked binary stream (`application/octet-stream`)
	- Format: Raw PCM 16-bit Little-Endian (`pcm_s16le`)
	- Headers:
	- `X-Audio-Codec`: `pcm_s16le`
	- `X-Audio-Sample-Rate`: `48000`
	- `X-Audio-Channels`: `1`

	### `GET /api/generate-stream/{stream_id}/result`
	Retrieves the finished synthesis payload (including fully-stitched base64-encoded audio) and chunk-to-timestamp range indexes once the job completes.

	- Response Example (`200 OK`):
	```json
	{
	"stream_id": "job_a1b2c3d4",
	"ready": true,
	"state": "done",
	"prompt_audio_path": "assets/audio/zh_1.wav",
	"run_status": "Finished successfully.",
	"stream_metrics": "state=done \| emitted=4.50s",
	"audio_chunk_ranges": [
	[0.0, 4.5, 0]
	],
	"audio_base64": "UklGRi..."
	}
	```

	### `POST /api/generate-stream/{stream_id}/close`
	Closes the active queue session, terminates background threads, and purges intermediate disk caches.

	---

	## 3. Metadata & Health Endpoints

	### `GET /api/metadata`
	Returns information regarding loaded tokenizer weights, hardware attention defaults, and runtime modules.

	- Response Example (`200 OK`):
	```json
	{
	"is_onnx": false,
	"is_cpu_only": true,
	"execution_provider": "cpu",
	"device_type": "cpu",
	"checkpoint_default_attn_implementation": "eager",
	"checkpoint_path": "models/MOSS-TTS-Nano-100M",
	"audio_tokenizer_path": "models/MOSS-Audio-Tokenizer-Nano",
	"text_normalization_status": "WeTextProcessing ready."
	}
	```

	### `GET /api/warmup-status`
	Returns model graph pre-warming status.

	- Response Example (`200 OK`):
	```json
	{
	"ready": true,
	"progress": 1.0,
	"message": "Warmup completed successfully.",
	"failed": false,
	"status_text": "Warmup Ready"
	}
	```

	### `GET /api/text-normalization-status`
	Returns Chinese/English front-end WeTextProcessing engine diagnostics.

	---

	## 4. Preset Preserves

	### `GET /api/demo-prompt-audio/{demo_id}`
	Streams a reference preset voice file directly.

	- Parameters: `demo_id` (e.g. `zh_1`, `en_3`).
	- Response: Binary audio file stream (`audio/wav`).

	---

	## 5. Live Deployments

	You can interact with a live deployment of this refactored, minimalist modular version here:
	- Hugging Face Space (Refactored): [Tom199328/MOSS-TTS-Nano](https://huggingface.co/spaces/Tom199328/MOSS-TTS-Nano)