Moss-Saudi / README.md
Rabe3's picture
Upload Moss-Saudi LoRA and merged weights
d56d515 verified
|
Raw
History Blame Contribute Delete
3.04 kB
---
license: apache-2.0
library_name: transformers
pipeline_tag: text-to-speech
base_model:
- OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5
tags:
- text-to-speech
- voice-cloning
- custom_code
- sglang-omni
- moss-tts
- moss-tts-local
- lora
- saudi-arabic
language:
- ar
---
# Moss-Saudi
This repository contains a Saudi Arabic LoRA fine-tune of
`OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5`.
Artifacts:
- Root files: merged full model weights for direct `from_pretrained` and SGLang-Omni serving.
- `lora_adapter/`: the original PEFT LoRA adapter, with portable Hub metadata.
- `training_summary.json`: sanitized training and checkpoint metadata.
The model uses `OpenMOSS-Team/MOSS-Audio-Tokenizer-v2` for 48 kHz stereo audio decoding.
## SGLang-Omni
SGLang-Omni supports `MossTTSLocalModel` through the OpenAI-compatible
`/v1/audio/speech` endpoint.
```bash
sgl-omni serve \
--model-path Rabe3/Moss-Saudi \
--allowed-media-domain huggingface.co \
--allowed-media-domain cas-bridge.xethub.hf.co \
--port 8000
```
Then request speech:
```bash
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Marhaba, this is a short Saudi Arabic TTS test."}' \
--output moss_saudi.wav
```
The included `serve_sglang_omni.sh` wrapper runs the same server command:
```bash
bash serve_sglang_omni.sh
```
## Transformers
```python
import torch
import torchaudio
from transformers import AutoModel, AutoProcessor
model_id = "Rabe3/Moss-Saudi"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
processor.audio_tokenizer = processor.audio_tokenizer.to(device)
model = AutoModel.from_pretrained(
model_id,
trust_remote_code=True,
dtype=dtype,
attn_implementation="sdpa" if device == "cuda" else "eager",
).to(device)
model.eval()
conversation = [[processor.build_user_message(
text="Marhaba, this is a short Saudi Arabic TTS test.",
language="Arabic",
)]]
batch = processor(conversation, mode="generation")
with torch.inference_mode():
outputs = model.generate(
input_ids=batch["input_ids"].to(device),
attention_mask=batch["attention_mask"].to(device),
max_new_tokens=4096,
do_sample=True,
audio_temperature=1.7,
audio_top_p=0.8,
audio_top_k=25,
)
message = processor.decode(outputs)[0]
audio = message.audio_codes_list[0].detach().cpu().to(torch.float32)
torchaudio.save("moss_saudi.wav", audio, processor.model_config.sampling_rate)
```
## LoRA Adapter
The adapter remains available if you want to apply it manually:
```python
import torch
from peft import PeftModel
from transformers import AutoModel
base = AutoModel.from_pretrained(
"OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5",
trust_remote_code=True,
dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(base, "Rabe3/Moss-Saudi", subfolder="lora_adapter")
```