MOSS-TTS-Nano / API_DOCUMENTATION.md
Tom199328's picture
Upload folder using huggingface_hub
813fb1c verified

MOSS-TTS-Nano API Documentation

This document describes the fully modularized REST API endpoints exposed by MOSS-TTS-Nano. The API is powered by FastAPI, and interactive OpenAPI swagger docs can also be accessed directly at http://127.0.0.1:18083/docs when the server is running.


1. Synchronous Synthesis Endpoint

POST /api/generate

Generates a complete synthesized audio file for a given text and reference prompt.

  • Request Type: multipart/form-data
  • Parameters:
Parameter Type Default Description
text string Required The target text to be synthesized.
demo_id string "" The identifier of a preset voice (e.g. zh_1, en_3). Required if prompt_audio is not provided.
prompt_audio file None Reference audio file for zero-shot voice cloning. Overrides demo_id if uploaded.
max_new_frames integer 375 Maximum number of new frames generated by the model.
voice_clone_max_text_tokens integer 75 Maximum context length for text-based voice cloning.
tts_max_batch_size integer 0 Batch size constraint for TTS generation (0 = auto).
codec_max_batch_size integer 0 Batch size constraint for Codec decoding (0 = auto).
enable_text_normalization string "1" Enables WeTextProcessing frontend ("1" or "0").
enable_normalize_tts_text string "1" Enables secondary robust text normalization ("1" or "0").
cpu_threads integer 0 Target CPU threads. 0 leverages system settings.
attn_implementation string "model_default" Attention implementation (e.g. "model_default", "sdpa", "eager").
do_sample string "1" Whether to use sampling during decoding ("1" or "0").
text_temperature float 1.0 Sampling temperature for text tokens.
text_top_p float 1.0 Top-p filtering threshold for text.
text_top_k integer 50 Top-k filtering threshold for text.
audio_temperature float 0.8 Sampling temperature for audio tokens.
audio_top_p float 0.95 Top-p filtering threshold for audio.
audio_top_k integer 25 Top-k filtering threshold for audio.
audio_repetition_penalty float 1.2 Repetition penalty scaling factor for audio generation.
seed string "0" Random seed (set to "0" or empty string for random/non-deterministic).
  • Response Example (200 OK):
{
  "audio_base64": "UklGRi...",
  "sample_rate": 48000,
  "run_status": "Time: 1.2s | Speed: 25.4 frames/s",
  "prompt_audio_path": "assets/audio/zh_1.wav",
  "warmup_status_text": "Model Ready",
  "text_chunks": ["Welcome to MOSS TTS."],
  "normalized_text": "welcome to moss tts",
  "normalization_method": "wetext",
  "text_normalization_language": "en"
}

2. Asynchronous Streaming API

Streaming endpoints are designed for ultra-low latency playback, delivering chunked PCM audio packets while the model is still executing.

POST /api/generate-stream/start

Starts a background streaming generation job and registers a stream_id.

  • Request Type: multipart/form-data
  • Parameters: Same parameters as /api/generate.
  • Response Example (200 OK):
{
  "stream_id": "job_a1b2c3d4",
  "audio_url": "/api/generate-stream/job_a1b2c3d4/audio",
  "status_url": "/api/generate-stream/job_a1b2c3d4/status",
  "result_url": "/api/generate-stream/job_a1b2c3d4/result",
  "sample_rate": 48000,
  "channels": 1,
  "run_status": "Streaming realtime audio...",
  "prompt_audio_path": "assets/audio/zh_1.wav",
  "text_chunks": ["Welcome to the streaming engine."]
}

GET /api/generate-stream/{stream_id}/status

Polls the active streaming status, progress metrics, and elapsed times.

  • Response Example (200 OK):
{
  "stream_id": "job_a1b2c3d4",
  "state": "running",
  "ready": false,
  "failed": false,
  "emitted_audio_seconds": 2.45,
  "lead_seconds": 1.28,
  "first_audio_latency_seconds": 0.45,
  "status_text": "Streaming...",
  "stream_metrics": "state=running | emitted=2.45s | lead=1.28s | first_audio=0.45s"
}

GET /api/generate-stream/{stream_id}/audio

Streams raw audio packets from the generation queue.

  • Response Type: Chunked binary stream (application/octet-stream)
  • Format: Raw PCM 16-bit Little-Endian (pcm_s16le)
  • Headers:
    • X-Audio-Codec: pcm_s16le
    • X-Audio-Sample-Rate: 48000
    • X-Audio-Channels: 1

GET /api/generate-stream/{stream_id}/result

Retrieves the finished synthesis payload (including fully-stitched base64-encoded audio) and chunk-to-timestamp range indexes once the job completes.

  • Response Example (200 OK):
{
  "stream_id": "job_a1b2c3d4",
  "ready": true,
  "state": "done",
  "prompt_audio_path": "assets/audio/zh_1.wav",
  "run_status": "Finished successfully.",
  "stream_metrics": "state=done | emitted=4.50s",
  "audio_chunk_ranges": [
    [0.0, 4.5, 0]
  ],
  "audio_base64": "UklGRi..."
}

POST /api/generate-stream/{stream_id}/close

Closes the active queue session, terminates background threads, and purges intermediate disk caches.


3. Metadata & Health Endpoints

GET /api/metadata

Returns information regarding loaded tokenizer weights, hardware attention defaults, and runtime modules.

  • Response Example (200 OK):
{
  "is_onnx": false,
  "is_cpu_only": true,
  "execution_provider": "cpu",
  "device_type": "cpu",
  "checkpoint_default_attn_implementation": "eager",
  "checkpoint_path": "models/MOSS-TTS-Nano-100M",
  "audio_tokenizer_path": "models/MOSS-Audio-Tokenizer-Nano",
  "text_normalization_status": "WeTextProcessing ready."
}

GET /api/warmup-status

Returns model graph pre-warming status.

  • Response Example (200 OK):
{
  "ready": true,
  "progress": 1.0,
  "message": "Warmup completed successfully.",
  "failed": false,
  "status_text": "Warmup Ready"
}

GET /api/text-normalization-status

Returns Chinese/English front-end WeTextProcessing engine diagnostics.


4. Preset Preserves

GET /api/demo-prompt-audio/{demo_id}

Streams a reference preset voice file directly.

  • Parameters: demo_id (e.g. zh_1, en_3).
  • Response: Binary audio file stream (audio/wav).

5. Live Deployments

You can interact with a live deployment of this refactored, minimalist modular version here: