File size: 5,770 Bytes
109a921
 
7baebb8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109a921
 
7baebb8
109a921
7baebb8
109a921
7baebb8
109a921
7baebb8
109a921
7baebb8
109a921
7baebb8
109a921
7baebb8
109a921
7baebb8
 
 
109a921
 
 
7baebb8
109a921
7baebb8
 
 
 
109a921
7baebb8
109a921
7baebb8
109a921
7baebb8
109a921
7baebb8
109a921
7baebb8
 
 
 
 
 
 
 
 
 
 
 
109a921
 
7baebb8
 
 
109a921
7baebb8
 
 
 
 
109a921
 
7baebb8
 
 
 
109a921
7baebb8
 
109a921
7baebb8
 
 
 
 
 
 
109a921
7baebb8
 
 
 
109a921
7baebb8
109a921
 
7baebb8
 
 
 
 
 
 
109a921
 
7baebb8
 
109a921
7baebb8
 
 
 
 
109a921
7baebb8
 
 
109a921
7baebb8
 
 
 
109a921
7baebb8
 
109a921
7baebb8
109a921
7baebb8
 
 
 
 
109a921
7baebb8
109a921
7baebb8
109a921
7baebb8
 
 
 
109a921
66ef993
109a921
7baebb8
109a921
7baebb8
 
 
 
109a921
7baebb8
109a921
7baebb8
109a921
7baebb8
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
---
library_name: transformers
license: mit
language:
- multilingual
- ar
- zh
- cs
- da
- nl
- en
- fi
- fr
- de
- he
- hu
- it
- ja
- ko
- no
- pl
- pt
- ru
- es
- sv
- th
- tr
- uk
tags:
- nlp
- code
- audio
- automatic-speech-recognition
- speech-summarization
- speech-translation
- phi-4-multimodal
- phi
- phi-4-mini
base_model: microsoft/Phi-4-multimodal-instruct
---

# Phi-4-Audio

**Phi-4-Audio** is a highly efficient adaptation of the [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) model, exclusively optimized for audio-text interactions (e.g., Automatic Speech Recognition).

By surgically removing the vision processing components—including the image encoder, vision projection layers, and associated processing logic—we have created a streamlined model that delivers lower memory usage while retaining the original model's powerful audio understanding capabilities.

## Usage & Performance

This model is ideal for scenarios where audio processing is the sole modality, such as transcription services, voice assistants, and audio-based QA systems. It is also well-suited for researchers aiming to fine-tune the model specifically for audio tasks without the overhead of unused vision parameters.

### Key Improvements

Comparing **Phi-4-Audio** against the original **Phi-4-multimodal-instruct** on a single NVIDIA RTX 5090 GPU:

* **Reduced Footprint:** Parameter count reduced by approximately **450 Million**.
* **Lower VRAM Usage:** Peak inference memory usage reduced by **~10% (0.84 GB saved)**.
* **Same Audio Performance:** Retains full audio-understanding capabilities while running lighter.

## Uses

### Intended Use

* **Automatic Speech Recognition (ASR):** High-fidelity transcription of spoken audio.
* **Speech Translation:** Direct speech-to-text translation.
* **Audio Summarization:** Generating summaries from audio recordings.
* **Spoken Instruction Tuning:** Fine-tuning on pure audio-text pairs.

### Out of Scope

-   **Image/Vision Tasks:** This model **cannot** process images. Attempts to pass image inputs will fail or raise errors, as the vision encoders have been stripped.

## How to Get Started

The model is fully compatible with the Hugging Face `transformers` library. You can use it exactly like the original model, but inputting images is not supported.

```python
import torch
from torch import nn
from io import BytesIO
from urllib.request import urlopen
from soundfile import read
from transformers import (
    AutoModelForCausalLM,
    AutoProcessor,
    Phi4MultimodalForCausalLM,
    Phi4MultimodalModel,
)


class StrippedVisionModule(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(
        self,
        **kwargs,
    ):
        raise ValueError("Vision is not supported")


def strip_vision_inplace(
    model: Phi4MultimodalForCausalLM | Phi4MultimodalModel,
) -> Phi4MultimodalForCausalLM | Phi4MultimodalModel:
    passed_model = model

    if isinstance(model, Phi4MultimodalForCausalLM):
        model = model.model

    emb_ext = model.embed_tokens_extend
    if hasattr(emb_ext, "image_embed"):
        emb_ext.image_embed = StrippedVisionModule()
    if hasattr(emb_ext.audio_embed, "down_proj_for_vision_speech"):
        emb_ext.audio_embed.down_proj_for_vision_speech = StrippedVisionModule()
    if hasattr(emb_ext.audio_embed, "up_proj_for_vision_speech"):
        emb_ext.audio_embed.up_proj_for_vision_speech = StrippedVisionModule()

    try:
        torch.cuda.empty_cache()
    except:
        pass

    return passed_model


model_path = "JacobLinCool/phi-4-audio"
device = "cuda"
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map=device, dtype=torch.bfloat16
)
processor = AutoProcessor.from_pretrained(model_path)
strip_vision_inplace(model)


audio_url = "https://huggingface.co/datasets/JacobLinCool/audio-testing/resolve/main/audio/audio-1.mp3"
audio, samplerate = read(BytesIO(urlopen(audio_url).read()))

user_prompt = "<|user|>"
assistant_prompt = "<|assistant|>"
prompt_suffix = "<|end|>"
speech_prompt = "Transcribe the audio clip into text."
prompt = f"{user_prompt}<|audio|>{speech_prompt}{prompt_suffix}{assistant_prompt}"

inputs = processor(
    text=prompt, audio=[audio], sampling_rate=16000, return_tensors="pt"
).to(device)

generate_ids = model.generate(**inputs)
response = processor.batch_decode(
    generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True
)[0]

print(f"{response=}")
```

## Model Details

- Base Architecture: Phi-4 Multimodal
- Modifications:
  - Removed `embed_tokens_extend.image_embed`
  - Removed `audio_embed.down_proj_for_vision_speech`
  - Removed `audio_embed.up_proj_for_vision_speech`

## Comparisons

### Parameter Count

| Model | Total Parameters | Reduction |
| --- | --- | --- |
| Phi-4-multimodal-instruct | 4,743,988,032 (4.74B) | - |
| Phi-4-Audio | 4,289,848,960 (4.29B) | -454M |

### Benchmark Results

Tested on NVIDIA RTX 5090, `torch.bfloat16`.

| Metric | Original Model | Phi-4-Audio | Delta |
| --- | --- | --- | --- |
| Peak Memory (GB) | 8.88 GB | 8.04 GB | -0.84 GB |
| Inference Speed (Warm) | ~100.5 tokens/s | ~100.5 tokens/s | Similar |

## Citation

If you use this model version, please cite the original Phi-4 Multimodal paper and acknowledge the modifications.

```bibtex
@article{abouelenin2025phi,
  title={Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras},
  author={Abouelenin, Abdelrahman and Ashfaq, Atabak and Atkinson, Adam and Awadalla, Hany and Bach, Nguyen and Bao, Jianmin and Benhaim, Alon and Cai, Martin and Chaudhary, Vishrav and Chen, Congcong and others},
  journal={arXiv preprint arXiv:2503.01743},
  year={2025}
}
```