daksh-neo's picture
Update README.md
e66b5d6 verified
metadata
license: apache-2.0
tags:
  - text-to-speech
  - audio
  - cpu-optimized
language:
  - zh
  - en
  - de
  - es
  - fr
  - ja
  - it
  - he
  - ko
  - ru
  - fa
  - ar
  - pl
  - pt
  - cs
  - da
  - sv
  - hu
  - el
  - tr
library_name: transformers

MOSS-TTS (CPU Optimized)

πŸš€ CPU Optimized Version: This repository contains a specialized build of MOSS-TTS that has been specifically optimized for high-performance execution on CPU-only environments.

This optimization and packaging process was performed autonomously by NEO, an autonomous ML engineering agent.

Overview

This version of MOSS-TTS uses runtime dynamic quantization and specific architectural configurations to deliver low-latency speech synthesis without requiring a GPU. MOSS-TTS is a state-of-the-art speech and sound generation model family designed for high-fidelity, high-expressiveness, and complex real-world scenarios.

Key Optimizations by NEO:

  • Dynamic INT8 Quantization: Reduces memory footprint and accelerates inference on modern CPUs.
  • Thread Scaling: Configured for optimal multi-threaded performance.
  • CPU-Friendly Tensors: Ensured all weights and buffers are optimized for FP32/INT8 execution paths.
  • Autonomous Validation: Verified functionality in resource-constrained environments.

πŸ›  Usage

Installation

pip install transformers torch torchaudio

Quick Start

from transformers import AutoModel, AutoProcessor
import torch

# Load the CPU-optimized model
model_name = "daksh-neo/MOSS-TTS"
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name, 
    trust_remote_code=True,
    torch_dtype=torch.float32 
)

# Inference (Example)
text = "This is a CPU-optimized speech synthesis by NEO."
inputs = processor(text=[text], mode="generation")
outputs = model.generate(**inputs)

πŸ“Š Capabilities

  • Zero-shot Voice Cloning: Clone voices from short reference clips.
  • Multilingual Support: High-quality synthesis across 20+ languages.
  • Long-form Stability: Synthesize stable audio for durations up to 1 hour.
  • Fine-grained Control: Phoneme-level and duration-level control for precise prosody.

πŸ— Architecture

This specific export is based on the MossTTSDelay architecture, optimized for sequential stability and CPU throughput.

Feature Specification
Optimization Engine NEO (Autonomous ML Agent)
Device Target CPU (x86_64 / ARM64)
Quantization Dynamic INT8
Sampling Rate 24kHz / 44.1kHz (Configurable)

πŸ“œ License

This model is released under the Apache-2.0 License.

🀝 Acknowledgments

Original model by MOSI.AI and the OpenMOSS Team. CPU Optimization and Hugging Face packaging by NEO.