Li-Ruixiao's picture
Update model repository ID in README.md
0750e3c
|
raw
history blame
1.97 kB
metadata
license: apache-2.0
library_name: transformers
tags:
  - audio
  - audio-tokenizer
  - neural-codec
  - moss-tts-family
  - CAT
  - speech-tokenizer
  - trust-remote-code

MossAudioTokenizer

MossAudioTokenizer is a neural audio codec model for audio tokenization and synthesis.

This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers transformers.models.moss_audio_tokenizer module. It is intended to be uploaded to a Hugging Face Hub model repository and loaded with trust_remote_code=True when needed.

Usage

Quickstart

import torch
from transformers import AutoModel

repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()

audio = torch.randn(1, 1, 3200)  # dummy waveform
enc = model.encode(audio, return_dict=True)
dec = model.decode(enc.audio_codes, return_dict=True)

Streaming

MossAudioTokenizerModel.encode and MossAudioTokenizerModel.decode support simple streaming via a chunk_duration argument.

  • chunk_duration is expressed in seconds.
  • It must be <= MossAudioTokenizerConfig.causal_transformer_context_duration.
  • chunk_duration * MossAudioTokenizerConfig.sampling_rate must be divisible by MossAudioTokenizerConfig.downsample_rate.
  • Streaming chunking only supports batch_size=1.
import torch
from transformers import AutoModel

repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
audio = torch.randn(1, 1, 3200)  # dummy waveform

# 0.08s @ 24kHz = 1920 samples, divisible by downsample_rate=1920
enc = model.encode(audio, return_dict=True, chunk_duration=0.08)
dec = model.decode(enc.audio_codes, return_dict=True, chunk_duration=0.08)

Repository layout

  • configuration_moss_audio_tokenizer.py
  • modeling_moss_audio_tokenizer.py
  • __init__.py
  • config.json
  • model weights