shreyask commited on
Commit
c655700
·
verified ·
1 Parent(s): d57ad8a

Add model card

Browse files
Files changed (1) hide show
  1. README.md +68 -0
README.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: mlx
3
+ tags:
4
+ - mlx
5
+ - mlx-audio
6
+ - qwen2-audio
7
+ - audio
8
+ - speech
9
+ - multimodal
10
+ - 4bit
11
+ base_model: Qwen/Qwen2-Audio-7B-Instruct
12
+ license: apache-2.0
13
+ pipeline_tag: audio-text-to-text
14
+ ---
15
+
16
+ # Qwen2-Audio-7B-Instruct (4-bit MLX)
17
+
18
+ 4-bit quantized version of [Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) for Apple Silicon via [mlx-audio](https://github.com/Blaizzy/mlx-audio).
19
+
20
+ ## Usage
21
+
22
+ ```python
23
+ from mlx_audio.stt.utils import load_model
24
+
25
+ model = load_model("mlx-community/Qwen2-Audio-7B-Instruct-4bit")
26
+
27
+ # Transcription
28
+ result = model.generate("audio.wav", prompt="Transcribe the audio.")
29
+ print(result.text)
30
+
31
+ # Audio understanding
32
+ result = model.generate("audio.wav", prompt="What emotion is the speaker expressing?")
33
+ print(result.text)
34
+
35
+ # Translation
36
+ result = model.generate("audio.wav", prompt="Translate the speech to French.")
37
+ print(result.text)
38
+ ```
39
+
40
+ ## Model Details
41
+
42
+ - **Base model**: Qwen/Qwen2-Audio-7B-Instruct
43
+ - **Quantization**: 4-bit (group_size=64), LLM only (encoder and projector kept in bf16)
44
+ - **Size**: ~4.2GB (vs ~15GB bf16)
45
+ - **Architecture**: Whisper-style encoder (32 layers) + Linear projector + Qwen2-7B LLM
46
+
47
+ ## Capabilities
48
+
49
+ - Speech transcription (ASR)
50
+ - Speech translation
51
+ - Audio captioning
52
+ - Emotion / sentiment detection
53
+ - Environmental sound classification
54
+ - Music understanding
55
+ - Voice chat (audio-only input)
56
+
57
+ ## Performance
58
+
59
+ Tested on Apple Silicon (M-series):
60
+ - ~4.7 tokens/sec generation (4-bit)
61
+ - Accurate transcription matching HuggingFace reference
62
+
63
+ ## Conversion
64
+
65
+ Converted using mlx-audio with:
66
+ - Audio encoder: bf16 (not quantized)
67
+ - Multi-modal projector: bf16 (not quantized)
68
+ - Language model: 4-bit quantized (group_size=64)