Instructions to use fredchu/MOSS-Audio-8B-Instruct-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use fredchu/MOSS-Audio-8B-Instruct-MLX with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir MOSS-Audio-8B-Instruct-MLX fredchu/MOSS-Audio-8B-Instruct-MLX
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
| license: apache-2.0 | |
| base_model: OpenMOSS-Team/MOSS-Audio-8B-Instruct | |
| tags: | |
| - mlx | |
| - audio | |
| - moss-audio | |
| - asr | |
| - int4 | |
| - apple-silicon | |
| language: | |
| - en | |
| - zh | |
| pipeline_tag: audio-text-to-text | |
| library_name: mlx | |
| # MOSS-Audio-8B-Instruct-MLX (hybrid: INT4 LLM + BF16 audio) | |
| An Apple MLX conversion of **MOSS-Audio-8B-Instruct** — the ASR-strongest MOSS-Audio | |
| checkpoint — for fast, low-memory inference on Apple Silicon. LLM quantized to uniform | |
| INT4 (group_size 64); audio encoder + adapter + DeepStack kept in BF16. | |
| > **Why this exists.** The community had MLX builds only of the *Thinking* variant. | |
| > But Thinking is not ASR-optimized: under identical INT4 quantization it mis-spells | |
| > letter-spoken tickers (e.g. "CRWD" → "CWD") and is unstable. **Instruct** transcribes | |
| > them correctly. This build brings Instruct's transcription quality to MLX speed/memory. | |
| 中文:這是 **MOSS-Audio-8B-Instruct** 的 Apple MLX 轉換版(LLM uniform INT4 + audio 路徑 BF16)。 | |
| 社群原本只有 *Thinking* 變體的 MLX 版,但 Thinking 非 ASR 優化——相同 INT4 量化下會把唸出字母的 | |
| ticker(如 "CRWD")辨識成 "CWD" 且不穩定。**Instruct** 辨識正確。本版把 Instruct 的轉錄品質 | |
| 帶到 MLX 的速度與記憶體。 | |
| ## Measured (Apple M1 Max 32GB, 28s zh+en clip) | |
| | Metric | PyTorch Instruct | **This (Instruct-MLX)** | Thinking-MLX | | |
| |---|:---:|:---:|:---:| | |
| | Ticker "CRWD" | C R W D ✅ | **C R W D ✅** | CWD ❌ | | |
| | English term (TradingView) | ✅ | ✅ | ✅(loops) | | |
| | Numerals | Chinese chars | **Arabic 47%** | Arabic | | |
| | Speed | 1.8x realtime | **6–9x** | 5–8x | | |
| | Peak memory | ~17 GB | **7.85 GB** | 7.85 GB | | |
| | Disk | 18 GB | **5.9 GB** | 5.9 GB | | |
| **Key finding.** Ticker-ASR degradation in the Thinking-MLX builds comes from the | |
| Thinking/Instruct *training difference*, not from INT4 quantization — under the same | |
| uniform INT4, Instruct keeps the ticker. So uniform 4-bit suffices; no mixed-precision needed. | |
| ## Usage | |
| ```bash | |
| pip install mlx mlx-lm soundfile numpy | |
| python inference.py --audio your_clip_16k_mono.wav | |
| ``` | |
| Transcription with per-segment timestamps (a Traditional-Chinese prompt triggers | |
| zh-Hant output): | |
| ```bash | |
| python run_moss.py --model . --audio clip.wav \ | |
| --prompt "請逐句轉錄這段音訊,每句標註開始時間。" --temp 0 --repetition-penalty 1.02 | |
| ``` | |
| - **Audio**: 16 kHz mono. Encoder window is Whisper-style **30 s max** — chunk longer audio. | |
| - **Decoding**: use **greedy (temp=0)** for ASR fidelity. `temp>0` removes the rare | |
| tail digit-loop but degrades content (wrong numerals, out-of-order timestamps). | |
| - **digit-loop**: occasionally the model fails to emit EOS and repeats a digit token | |
| at the very tail; post-truncate repeated trailing digits. Quantization weakens EOS; | |
| it is a known, harmless tail artifact for transcription use. | |
| ## How it was converted | |
| Pure metadata-mapped weight conversion (no retraining): | |
| 1. `stage1_mapping.py` — verify every MLX target key is sourceable from the PyTorch | |
| checkpoint; discover the conv layout transform `transpose(0,2,3,1)` | |
| (PyTorch `[out,in,h,w]` → MLX `[out,h,w,in]`). | |
| 2. `stage2_convert.py` — extract `language_model.*` + `lm_head`, quantize to INT4 | |
| (group_size 64) via mlx; extract audio encoder/adapter/DeepStack, apply the conv | |
| transpose, save BF16. Output mirrors the RumiLabs bridge layout exactly. | |
| ## Limitations | |
| - 30-second audio window (chunk + offset timestamps for longer input). | |
| - Tail digit-loop under greedy (post-truncate). | |
| - Homophone errors on domain terms (e.g. 300均 → 三百軍) — fix with a glossary/post-pass. | |
| ## Credits | |
| - Base model: [OpenMOSS-Team/MOSS-Audio](https://github.com/OpenMOSS/MOSS-Audio) (Apache-2.0) | |
| - MLX bridge (encoder/mel/DeepStack port): [RumiLabs](https://huggingface.co/RumiLabs) Thinking-MLX builds | |
| - Instruct→MLX conversion: this work | |
| ## License | |
| Apache-2.0 (inherited from base model). | |