metadata
license: apache-2.0
library_name: transformers
tags:
- audio
- audio-tokenizer
- neural-codec
- moss-tts-family
- CAT
- speech-tokenizer
- trust-remote-code
MossAudioTokenizer
MossAudioTokenizer is a neural audio codec model for audio tokenization and synthesis.
This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers
transformers.models.moss_audio_tokenizer module. It is intended to be uploaded to a Hugging Face Hub model repository
and loaded with trust_remote_code=True when needed.
Usage
Quickstart
import torch
from transformers import AutoModel
repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
audio = torch.randn(1, 1, 3200) # dummy waveform
enc = model.encode(audio, return_dict=True)
dec = model.decode(enc.audio_codes, return_dict=True)
Streaming
MossAudioTokenizerModel.encode and MossAudioTokenizerModel.decode support simple streaming via a chunk_duration
argument.
chunk_durationis expressed in seconds.- It must be <=
MossAudioTokenizerConfig.causal_transformer_context_duration. chunk_duration * MossAudioTokenizerConfig.sampling_ratemust be divisible byMossAudioTokenizerConfig.downsample_rate.- Streaming chunking only supports
batch_size=1.
import torch
from transformers import AutoModel
repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
audio = torch.randn(1, 1, 3200) # dummy waveform
# 0.08s @ 24kHz = 1920 samples, divisible by downsample_rate=1920
enc = model.encode(audio, return_dict=True, chunk_duration=0.08)
dec = model.decode(enc.audio_codes, return_dict=True, chunk_duration=0.08)
Repository layout
configuration_moss_audio_tokenizer.pymodeling_moss_audio_tokenizer.py__init__.pyconfig.json- model weights