mlx-community/mimi-encoder-mlx

The encoder half of Kyutai's Mimi neural audio codec, converted to MLX format for native inference on Apple Silicon and consumed by the xocialize/mimi-encoder-mlx-swift Swift port. Refer to the original model card for full details.

Model

Family: Mimi neural audio codec (Kyutai / Moshi — Défossez et al., arXiv:2410.00037)
This artifact: the encoder only (SEANet conv encoder → causal transformer → stride-2 downsample → split RVQ)
Input: 24000 Hz, mono
Output: [16, T] codebook-index grid at 12.5 Hz (1 semantic + 15 acoustic codebooks)
Precision: fp32 (145 tensors)

Files

encoder.safetensors — the MLX encoder weights (fp32), extracted/converted from kyutai/mimi.

Usage (Swift / MLX)

import MimiCodecEncoder

let encoder = MimiEncoder(config: .qwen3TTS12Hz)
try encoder.loadWeights(from: encoderWeightsURL)   // encoder.safetensors
let codes = encoder.encode(audio: audioArray)      // [16, T]

Source

Original model: https://huggingface.co/kyutai/mimi
Swift consumer: https://github.com/xocialize/mimi-encoder-mlx-swift

License

CC-BY-4.0 (Kyutai) — permissive, attribution required. This is a derivative (encoder-only, format-converted) of kyutai/mimi; attribution to Kyutai is retained.

Downloads last month: -; Downloads are not tracked for this model. How to track

MLX

Hardware compatibility

Quantized

Model tree for mlx-community/mimi-encoder-mlx

Base model

kyutai/mimi

Finetuned

(7)

this model

Paper for mlx-community/mimi-encoder-mlx

Moshi: a speech-text foundation model for real-time dialogue

Paper • 2410.00037 • Published Sep 17, 2024 • 18