---
license: apache-2.0
tags:
- text-to-speech
- audio
- cpu-optimized
language:
- zh
- en
- de
- es
- fr
- ja
- it
- he
- ko
- ru
- fa
- ar
- pl
- pt
- cs
- da
- sv
- hu
- el
- tr
library_name: transformers
---

# MOSS-TTS (CPU Optimized)

> **🚀 CPU Optimized Version**: This repository contains a specialized build of **MOSS-TTS** that has been specifically optimized for high-performance execution on CPU-only environments.

**This optimization and packaging process was performed autonomously by [NEO](https://github.com/daksh-neo), an autonomous ML engineering agent.**

## Overview
This version of MOSS-TTS uses runtime dynamic quantization and specific architectural configurations to deliver low-latency speech synthesis without requiring a GPU. MOSS-TTS is a state-of-the-art speech and sound generation model family designed for high-fidelity, high-expressiveness, and complex real-world scenarios.

### Key Optimizations by NEO:
- **Dynamic INT8 Quantization**: Reduces memory footprint and accelerates inference on modern CPUs.
- **Thread Scaling**: Configured for optimal multi-threaded performance.
- **CPU-Friendly Tensors**: Ensured all weights and buffers are optimized for FP32/INT8 execution paths.
- **Autonomous Validation**: Verified functionality in resource-constrained environments.

---

## 🛠 Usage

### Installation
```bash
pip install transformers torch torchaudio
```

### Quick Start
```python
from transformers import AutoModel, AutoProcessor
import torch

# Load the CPU-optimized model
model_name = "daksh-neo/MOSS-TTS"
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name, 
    trust_remote_code=True,
    torch_dtype=torch.float32 
)

# Inference (Example)
text = "This is a CPU-optimized speech synthesis by NEO."
inputs = processor(text=[text], mode="generation")
outputs = model.generate(**inputs)
```

---

## 📊 Capabilities
- **Zero-shot Voice Cloning**: Clone voices from short reference clips.
- **Multilingual Support**: High-quality synthesis across 20+ languages.
- **Long-form Stability**: Synthesize stable audio for durations up to 1 hour.
- **Fine-grained Control**: Phoneme-level and duration-level control for precise prosody.

## 🏗 Architecture
This specific export is based on the **MossTTSDelay** architecture, optimized for sequential stability and CPU throughput.

| Feature | Specification |
|---|---|
| Optimization Engine | NEO (Autonomous ML Agent) |
| Device Target | CPU (x86_64 / ARM64) |
| Quantization | Dynamic INT8 |
| Sampling Rate | 24kHz / 44.1kHz (Configurable) |

## 📜 License
This model is released under the **Apache-2.0 License**.

## 🤝 Acknowledgments
Original model by [MOSI.AI](https://mosi.cn/) and the [OpenMOSS Team](https://github.com/OpenMOSS/MOSS-TTS).
CPU Optimization and Hugging Face packaging by **NEO**.