Upload folder using huggingface_hub

59237c3 verified about 1 month ago

3.95 kB

license: apache-2.0
base_model: OpenMOSS-Team/MOSS-Audio-8B-Instruct
tags:
  - mlx
  - audio
  - moss-audio
  - asr
  - int4
  - apple-silicon
language:
  - en
  - zh
pipeline_tag: audio-text-to-text
library_name: mlx

MOSS-Audio-8B-Instruct-MLX (hybrid: INT4 LLM + BF16 audio)

An Apple MLX conversion of MOSS-Audio-8B-Instruct — the ASR-strongest MOSS-Audio checkpoint — for fast, low-memory inference on Apple Silicon. LLM quantized to uniform INT4 (group_size 64); audio encoder + adapter + DeepStack kept in BF16.

Why this exists. The community had MLX builds only of the Thinking variant. But Thinking is not ASR-optimized: under identical INT4 quantization it mis-spells letter-spoken tickers (e.g. "CRWD" → "CWD") and is unstable. Instruct transcribes them correctly. This build brings Instruct's transcription quality to MLX speed/memory.

中文：這是 MOSS-Audio-8B-Instruct 的 Apple MLX 轉換版（LLM uniform INT4 + audio 路徑 BF16）。社群原本只有 Thinking 變體的 MLX 版，但 Thinking 非 ASR 優化——相同 INT4 量化下會把唸出字母的 ticker（如 "CRWD"）辨識成 "CWD" 且不穩定。Instruct 辨識正確。本版把 Instruct 的轉錄品質帶到 MLX 的速度與記憶體。

Measured (Apple M1 Max 32GB, 28s zh+en clip)

Metric	PyTorch Instruct	This (Instruct-MLX)	Thinking-MLX
Ticker "CRWD"	C R W D ✅	C R W D ✅	CWD ❌
English term (TradingView)	✅	✅	✅(loops)
Numerals	Chinese chars	Arabic 47%	Arabic
Speed	1.8x realtime	6–9x	5–8x
Peak memory	~17 GB	7.85 GB	7.85 GB
Disk	18 GB	5.9 GB	5.9 GB

Key finding. Ticker-ASR degradation in the Thinking-MLX builds comes from the Thinking/Instruct training difference, not from INT4 quantization — under the same uniform INT4, Instruct keeps the ticker. So uniform 4-bit suffices; no mixed-precision needed.

Usage

pip install mlx mlx-lm soundfile numpy
python inference.py --audio your_clip_16k_mono.wav

Transcription with per-segment timestamps (a Traditional-Chinese prompt triggers zh-Hant output):

python run_moss.py --model . --audio clip.wav \
  --prompt "請逐句轉錄這段音訊，每句標註開始時間。" --temp 0 --repetition-penalty 1.02

Audio: 16 kHz mono. Encoder window is Whisper-style 30 s max — chunk longer audio.
Decoding: use greedy (temp=0) for ASR fidelity. temp>0 removes the rare tail digit-loop but degrades content (wrong numerals, out-of-order timestamps).
digit-loop: occasionally the model fails to emit EOS and repeats a digit token at the very tail; post-truncate repeated trailing digits. Quantization weakens EOS; it is a known, harmless tail artifact for transcription use.

How it was converted

Pure metadata-mapped weight conversion (no retraining):

stage1_mapping.py — verify every MLX target key is sourceable from the PyTorch checkpoint; discover the conv layout transform transpose(0,2,3,1) (PyTorch [out,in,h,w] → MLX [out,h,w,in]).
stage2_convert.py — extract language_model.* + lm_head, quantize to INT4 (group_size 64) via mlx; extract audio encoder/adapter/DeepStack, apply the conv transpose, save BF16. Output mirrors the RumiLabs bridge layout exactly.

Limitations

30-second audio window (chunk + offset timestamps for longer input).
Tail digit-loop under greedy (post-truncate).
Homophone errors on domain terms (e.g. 300均 → 三百軍) — fix with a glossary/post-pass.

Credits

Base model: OpenMOSS-Team/MOSS-Audio (Apache-2.0)
MLX bridge (encoder/mel/DeepStack port): RumiLabs Thinking-MLX builds
Instruct→MLX conversion: this work

License

Apache-2.0 (inherited from base model).