--- license: apache-2.0 tags: - text-to-speech - audio - cpu-optimized language: - zh - en - de - es - fr - ja - it - he - ko - ru - fa - ar - pl - pt - cs - da - sv - hu - el - tr library_name: transformers --- # MOSS-TTS (CPU Optimized) > **🚀 CPU Optimized Version**: This repository contains a specialized build of **MOSS-TTS** that has been specifically optimized for high-performance execution on CPU-only environments. **This optimization and packaging process was performed autonomously by [NEO](https://github.com/daksh-neo), an autonomous ML engineering agent.** ## Overview This version of MOSS-TTS uses runtime dynamic quantization and specific architectural configurations to deliver low-latency speech synthesis without requiring a GPU. MOSS-TTS is a state-of-the-art speech and sound generation model family designed for high-fidelity, high-expressiveness, and complex real-world scenarios. ### Key Optimizations by NEO: - **Dynamic INT8 Quantization**: Reduces memory footprint and accelerates inference on modern CPUs. - **Thread Scaling**: Configured for optimal multi-threaded performance. - **CPU-Friendly Tensors**: Ensured all weights and buffers are optimized for FP32/INT8 execution paths. - **Autonomous Validation**: Verified functionality in resource-constrained environments. --- ## 🛠 Usage ### Installation ```bash pip install transformers torch torchaudio ``` ### Quick Start ```python from transformers import AutoModel, AutoProcessor import torch # Load the CPU-optimized model model_name = "daksh-neo/MOSS-TTS" processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True) model = AutoModel.from_pretrained( model_name, trust_remote_code=True, torch_dtype=torch.float32 ) # Inference (Example) text = "This is a CPU-optimized speech synthesis by NEO." inputs = processor(text=[text], mode="generation") outputs = model.generate(**inputs) ``` --- ## 📊 Capabilities - **Zero-shot Voice Cloning**: Clone voices from short reference clips. - **Multilingual Support**: High-quality synthesis across 20+ languages. - **Long-form Stability**: Synthesize stable audio for durations up to 1 hour. - **Fine-grained Control**: Phoneme-level and duration-level control for precise prosody. ## 🏗 Architecture This specific export is based on the **MossTTSDelay** architecture, optimized for sequential stability and CPU throughput. | Feature | Specification | |---|---| | Optimization Engine | NEO (Autonomous ML Agent) | | Device Target | CPU (x86_64 / ARM64) | | Quantization | Dynamic INT8 | | Sampling Rate | 24kHz / 44.1kHz (Configurable) | ## 📜 License This model is released under the **Apache-2.0 License**. ## 🤝 Acknowledgments Original model by [MOSI.AI](https://mosi.cn/) and the [OpenMOSS Team](https://github.com/OpenMOSS/MOSS-TTS). CPU Optimization and Hugging Face packaging by **NEO**.