MOSS-TTS-Nano

Running

App Files Files Community

MOSS-TTS-Nano / API_DOCUMENTATION.md

Tom199328

Upload folder using huggingface_hub

813fb1c verified 11 days ago

preview code

raw

history blame contribute delete

6.66 kB

MOSS-TTS-Nano API Documentation

This document describes the fully modularized REST API endpoints exposed by MOSS-TTS-Nano. The API is powered by FastAPI, and interactive OpenAPI swagger docs can also be accessed directly at http://127.0.0.1:18083/docs when the server is running.

1. Synchronous Synthesis Endpoint

`POST /api/generate`

Generates a complete synthesized audio file for a given text and reference prompt.

Request Type: multipart/form-data
Parameters:

Parameter	Type	Default	Description
`text`	`string`	Required	The target text to be synthesized.
`demo_id`	`string`	`""`	The identifier of a preset voice (e.g. `zh_1`, `en_3`). Required if `prompt_audio` is not provided.
`prompt_audio`	`file`	`None`	Reference audio file for zero-shot voice cloning. Overrides `demo_id` if uploaded.
`max_new_frames`	`integer`	`375`	Maximum number of new frames generated by the model.
`voice_clone_max_text_tokens`	`integer`	`75`	Maximum context length for text-based voice cloning.
`tts_max_batch_size`	`integer`	`0`	Batch size constraint for TTS generation (0 = auto).
`codec_max_batch_size`	`integer`	`0`	Batch size constraint for Codec decoding (0 = auto).
`enable_text_normalization`	`string`	`"1"`	Enables WeTextProcessing frontend (`"1"` or `"0"`).
`enable_normalize_tts_text`	`string`	`"1"`	Enables secondary robust text normalization (`"1"` or `"0"`).
`cpu_threads`	`integer`	`0`	Target CPU threads. 0 leverages system settings.
`attn_implementation`	`string`	`"model_default"`	Attention implementation (e.g. `"model_default"`, `"sdpa"`, `"eager"`).
`do_sample`	`string`	`"1"`	Whether to use sampling during decoding (`"1"` or `"0"`).
`text_temperature`	`float`	`1.0`	Sampling temperature for text tokens.
`text_top_p`	`float`	`1.0`	Top-p filtering threshold for text.
`text_top_k`	`integer`	`50`	Top-k filtering threshold for text.
`audio_temperature`	`float`	`0.8`	Sampling temperature for audio tokens.
`audio_top_p`	`float`	`0.95`	Top-p filtering threshold for audio.
`audio_top_k`	`integer`	`25`	Top-k filtering threshold for audio.
`audio_repetition_penalty`	`float`	`1.2`	Repetition penalty scaling factor for audio generation.
`seed`	`string`	`"0"`	Random seed (set to `"0"` or empty string for random/non-deterministic).

Response Example (200 OK):

{
  "audio_base64": "UklGRi...",
  "sample_rate": 48000,
  "run_status": "Time: 1.2s | Speed: 25.4 frames/s",
  "prompt_audio_path": "assets/audio/zh_1.wav",
  "warmup_status_text": "Model Ready",
  "text_chunks": ["Welcome to MOSS TTS."],
  "normalized_text": "welcome to moss tts",
  "normalization_method": "wetext",
  "text_normalization_language": "en"
}

2. Asynchronous Streaming API

Streaming endpoints are designed for ultra-low latency playback, delivering chunked PCM audio packets while the model is still executing.

`POST /api/generate-stream/start`

Starts a background streaming generation job and registers a stream_id.

Request Type: multipart/form-data
Parameters: Same parameters as /api/generate.
Response Example (200 OK):

{
  "stream_id": "job_a1b2c3d4",
  "audio_url": "/api/generate-stream/job_a1b2c3d4/audio",
  "status_url": "/api/generate-stream/job_a1b2c3d4/status",
  "result_url": "/api/generate-stream/job_a1b2c3d4/result",
  "sample_rate": 48000,
  "channels": 1,
  "run_status": "Streaming realtime audio...",
  "prompt_audio_path": "assets/audio/zh_1.wav",
  "text_chunks": ["Welcome to the streaming engine."]
}

`GET /api/generate-stream/{stream_id}/status`

Polls the active streaming status, progress metrics, and elapsed times.

Response Example (200 OK):

{
  "stream_id": "job_a1b2c3d4",
  "state": "running",
  "ready": false,
  "failed": false,
  "emitted_audio_seconds": 2.45,
  "lead_seconds": 1.28,
  "first_audio_latency_seconds": 0.45,
  "status_text": "Streaming...",
  "stream_metrics": "state=running | emitted=2.45s | lead=1.28s | first_audio=0.45s"
}

`GET /api/generate-stream/{stream_id}/audio`

Streams raw audio packets from the generation queue.

Response Type: Chunked binary stream (application/octet-stream)
Format: Raw PCM 16-bit Little-Endian (pcm_s16le)
Headers:
- X-Audio-Codec: pcm_s16le
- X-Audio-Sample-Rate: 48000
- X-Audio-Channels: 1

`GET /api/generate-stream/{stream_id}/result`

Retrieves the finished synthesis payload (including fully-stitched base64-encoded audio) and chunk-to-timestamp range indexes once the job completes.

Response Example (200 OK):

{
  "stream_id": "job_a1b2c3d4",
  "ready": true,
  "state": "done",
  "prompt_audio_path": "assets/audio/zh_1.wav",
  "run_status": "Finished successfully.",
  "stream_metrics": "state=done | emitted=4.50s",
  "audio_chunk_ranges": [
    [0.0, 4.5, 0]
  ],
  "audio_base64": "UklGRi..."
}

`POST /api/generate-stream/{stream_id}/close`

Closes the active queue session, terminates background threads, and purges intermediate disk caches.

3. Metadata & Health Endpoints

`GET /api/metadata`

Returns information regarding loaded tokenizer weights, hardware attention defaults, and runtime modules.

Response Example (200 OK):

{
  "is_onnx": false,
  "is_cpu_only": true,
  "execution_provider": "cpu",
  "device_type": "cpu",
  "checkpoint_default_attn_implementation": "eager",
  "checkpoint_path": "models/MOSS-TTS-Nano-100M",
  "audio_tokenizer_path": "models/MOSS-Audio-Tokenizer-Nano",
  "text_normalization_status": "WeTextProcessing ready."
}

`GET /api/warmup-status`

Returns model graph pre-warming status.

Response Example (200 OK):

{
  "ready": true,
  "progress": 1.0,
  "message": "Warmup completed successfully.",
  "failed": false,
  "status_text": "Warmup Ready"
}

`GET /api/text-normalization-status`

Returns Chinese/English front-end WeTextProcessing engine diagnostics.

4. Preset Preserves

`GET /api/demo-prompt-audio/{demo_id}`

Streams a reference preset voice file directly.

Parameters: demo_id (e.g. zh_1, en_3).
Response: Binary audio file stream (audio/wav).

5. Live Deployments

You can interact with a live deployment of this refactored, minimalist modular version here:

Hugging Face Space (Refactored): Tom199328/MOSS-TTS-Nano