MOSS-TTS / README.md
daksh-neo's picture
Upload folder using huggingface_hub
f1f9182 verified
---
license: apache-2.0
tags:
- text-to-speech
- audio
- cpu-optimized
language:
- zh
- en
- de
- es
- fr
- ja
- it
- he
- ko
- ru
- fa
- ar
- pl
- pt
- cs
- da
- sv
- hu
- el
- tr
library_name: transformers
---
# MOSS-TTS (CPU Optimized)
> **πŸš€ CPU Optimized Version**: This repository contains a specialized build of **MOSS-TTS** that has been specifically optimized for high-performance execution on CPU-only environments.
**This optimization and packaging process was performed autonomously by [NEO](https://github.com/daksh-neo), an autonomous ML engineering agent.**
## Overview
This version of MOSS-TTS uses runtime dynamic quantization and specific architectural configurations to deliver low-latency speech synthesis without requiring a GPU. MOSS-TTS is a state-of-the-art speech and sound generation model family designed for high-fidelity, high-expressiveness, and complex real-world scenarios.
### Key Optimizations by NEO:
- **Dynamic INT8 Quantization**: Reduces memory footprint and accelerates inference on modern CPUs.
- **Thread Scaling**: Configured for optimal multi-threaded performance.
- **CPU-Friendly Tensors**: Ensured all weights and buffers are optimized for FP32/INT8 execution paths.
- **Autonomous Validation**: Verified functionality in resource-constrained environments.
---
## πŸ›  Usage
### Installation
```bash
pip install transformers torch torchaudio
```
### Quick Start
```python
from transformers import AutoModel, AutoProcessor
import torch
# Load the CPU-optimized model
model_name = "daksh-neo/MOSS-TTS"
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.float32
)
# Inference (Example)
text = "This is a CPU-optimized speech synthesis by NEO."
inputs = processor(text=[text], mode="generation")
outputs = model.generate(**inputs)
```
---
## πŸ“Š Capabilities
- **Zero-shot Voice Cloning**: Clone voices from short reference clips.
- **Multilingual Support**: High-quality synthesis across 20+ languages.
- **Long-form Stability**: Synthesize stable audio for durations up to 1 hour.
- **Fine-grained Control**: Phoneme-level and duration-level control for precise prosody.
## πŸ— Architecture
This specific export is based on the **MossTTSDelay** architecture, optimized for sequential stability and CPU throughput.
| Feature | Specification |
|---|---|
| Optimization Engine | NEO (Autonomous ML Agent) |
| Device Target | CPU (x86_64 / ARM64) |
| Quantization | Dynamic INT8 |
| Sampling Rate | 24kHz / 44.1kHz (Configurable) |
## πŸ“œ License
This model is released under the **Apache-2.0 License**.
## 🀝 Acknowledgments
Original model by [MOSI.AI](https://mosi.cn/) and the [OpenMOSS Team](https://github.com/OpenMOSS/MOSS-TTS).
CPU Optimization and Hugging Face packaging by **NEO**.