Upload folder using huggingface_hub

59237c3 verified about 1 month ago

3.95 kB

	---
	license: apache-2.0
	base_model: OpenMOSS-Team/MOSS-Audio-8B-Instruct
	tags:
	- mlx
	- audio
	- moss-audio
	- asr
	- int4
	- apple-silicon
	language:
	- en
	- zh
	pipeline_tag: audio-text-to-text
	library_name: mlx
	---

	# MOSS-Audio-8B-Instruct-MLX (hybrid: INT4 LLM + BF16 audio)

	An Apple MLX conversion of MOSS-Audio-8B-Instruct — the ASR-strongest MOSS-Audio
	checkpoint — for fast, low-memory inference on Apple Silicon. LLM quantized to uniform
	INT4 (group_size 64); audio encoder + adapter + DeepStack kept in BF16.

	> Why this exists. The community had MLX builds only of the Thinking variant.
	> But Thinking is not ASR-optimized: under identical INT4 quantization it mis-spells
	> letter-spoken tickers (e.g. "CRWD" → "CWD") and is unstable. Instruct transcribes
	> them correctly. This build brings Instruct's transcription quality to MLX speed/memory.

	中文：這是 MOSS-Audio-8B-Instruct 的 Apple MLX 轉換版（LLM uniform INT4 + audio 路徑 BF16）。
	社群原本只有 Thinking 變體的 MLX 版，但 Thinking 非 ASR 優化——相同 INT4 量化下會把唸出字母的
	ticker（如 "CRWD"）辨識成 "CWD" 且不穩定。Instruct 辨識正確。本版把 Instruct 的轉錄品質
	帶到 MLX 的速度與記憶體。

	## Measured (Apple M1 Max 32GB, 28s zh+en clip)

	\| Metric \| PyTorch Instruct \| This (Instruct-MLX) \| Thinking-MLX \|
	\|---\|:---:\|:---:\|:---:\|
	\| Ticker "CRWD" \| C R W D ✅ \| C R W D ✅ \| CWD ❌ \|
	\| English term (TradingView) \| ✅ \| ✅ \| ✅(loops) \|
	\| Numerals \| Chinese chars \| Arabic 47% \| Arabic \|
	\| Speed \| 1.8x realtime \| 6–9x \| 5–8x \|
	\| Peak memory \| ~17 GB \| 7.85 GB \| 7.85 GB \|
	\| Disk \| 18 GB \| 5.9 GB \| 5.9 GB \|

	Key finding. Ticker-ASR degradation in the Thinking-MLX builds comes from the
	Thinking/Instruct training difference, not from INT4 quantization — under the same
	uniform INT4, Instruct keeps the ticker. So uniform 4-bit suffices; no mixed-precision needed.

	## Usage

	```bash
	pip install mlx mlx-lm soundfile numpy
	python inference.py --audio your_clip_16k_mono.wav
	```

	Transcription with per-segment timestamps (a Traditional-Chinese prompt triggers
	zh-Hant output):

	```bash
	python run_moss.py --model . --audio clip.wav \
	--prompt "請逐句轉錄這段音訊，每句標註開始時間。" --temp 0 --repetition-penalty 1.02
	```

	- Audio: 16 kHz mono. Encoder window is Whisper-style 30 s max — chunk longer audio.
	- Decoding: use greedy (temp=0) for ASR fidelity. `temp>0` removes the rare
	tail digit-loop but degrades content (wrong numerals, out-of-order timestamps).
	- digit-loop: occasionally the model fails to emit EOS and repeats a digit token
	at the very tail; post-truncate repeated trailing digits. Quantization weakens EOS;
	it is a known, harmless tail artifact for transcription use.

	## How it was converted

	Pure metadata-mapped weight conversion (no retraining):

	1. `stage1_mapping.py` — verify every MLX target key is sourceable from the PyTorch
	checkpoint; discover the conv layout transform `transpose(0,2,3,1)`
	(PyTorch `[out,in,h,w]` → MLX `[out,h,w,in]`).
	2. `stage2_convert.py` — extract `language_model.*` + `lm_head`, quantize to INT4
	(group_size 64) via mlx; extract audio encoder/adapter/DeepStack, apply the conv
	transpose, save BF16. Output mirrors the RumiLabs bridge layout exactly.

	## Limitations

	- 30-second audio window (chunk + offset timestamps for longer input).
	- Tail digit-loop under greedy (post-truncate).
	- Homophone errors on domain terms (e.g. 300均 → 三百軍) — fix with a glossary/post-pass.

	## Credits

	- Base model: [OpenMOSS-Team/MOSS-Audio](https://github.com/OpenMOSS/MOSS-Audio) (Apache-2.0)
	- MLX bridge (encoder/mel/DeepStack port): [RumiLabs](https://huggingface.co/RumiLabs) Thinking-MLX builds
	- Instruct→MLX conversion: this work

	## License

	Apache-2.0 (inherited from base model).