--- license: apache-2.0 library_name: transformers pipeline_tag: text-to-speech base_model: - OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5 tags: - text-to-speech - voice-cloning - custom_code - sglang-omni - moss-tts - moss-tts-local - lora - saudi-arabic language: - ar --- # Moss-Saudi This repository contains a Saudi Arabic LoRA fine-tune of `OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5`. Artifacts: - Root files: merged full model weights for direct `from_pretrained` and SGLang-Omni serving. - `lora_adapter/`: the original PEFT LoRA adapter, with portable Hub metadata. - `training_summary.json`: sanitized training and checkpoint metadata. The model uses `OpenMOSS-Team/MOSS-Audio-Tokenizer-v2` for 48 kHz stereo audio decoding. ## SGLang-Omni SGLang-Omni supports `MossTTSLocalModel` through the OpenAI-compatible `/v1/audio/speech` endpoint. ```bash sgl-omni serve \ --model-path Rabe3/Moss-Saudi \ --allowed-media-domain huggingface.co \ --allowed-media-domain cas-bridge.xethub.hf.co \ --port 8000 ``` Then request speech: ```bash curl -X POST http://localhost:8000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{"input": "Marhaba, this is a short Saudi Arabic TTS test."}' \ --output moss_saudi.wav ``` The included `serve_sglang_omni.sh` wrapper runs the same server command: ```bash bash serve_sglang_omni.sh ``` ## Transformers ```python import torch import torchaudio from transformers import AutoModel, AutoProcessor model_id = "Rabe3/Moss-Saudi" device = "cuda" if torch.cuda.is_available() else "cpu" dtype = torch.bfloat16 if device == "cuda" else torch.float32 processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) processor.audio_tokenizer = processor.audio_tokenizer.to(device) model = AutoModel.from_pretrained( model_id, trust_remote_code=True, dtype=dtype, attn_implementation="sdpa" if device == "cuda" else "eager", ).to(device) model.eval() conversation = [[processor.build_user_message( text="Marhaba, this is a short Saudi Arabic TTS test.", language="Arabic", )]] batch = processor(conversation, mode="generation") with torch.inference_mode(): outputs = model.generate( input_ids=batch["input_ids"].to(device), attention_mask=batch["attention_mask"].to(device), max_new_tokens=4096, do_sample=True, audio_temperature=1.7, audio_top_p=0.8, audio_top_k=25, ) message = processor.decode(outputs)[0] audio = message.audio_codes_list[0].detach().cpu().to(torch.float32) torchaudio.save("moss_saudi.wav", audio, processor.model_config.sampling_rate) ``` ## LoRA Adapter The adapter remains available if you want to apply it manually: ```python import torch from peft import PeftModel from transformers import AutoModel base = AutoModel.from_pretrained( "OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5", trust_remote_code=True, dtype=torch.bfloat16, ) model = PeftModel.from_pretrained(base, "Rabe3/Moss-Saudi", subfolder="lora_adapter") ```