File size: 5,770 Bytes
109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 66ef993 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 109a921 7baebb8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 |
---
library_name: transformers
license: mit
language:
- multilingual
- ar
- zh
- cs
- da
- nl
- en
- fi
- fr
- de
- he
- hu
- it
- ja
- ko
- no
- pl
- pt
- ru
- es
- sv
- th
- tr
- uk
tags:
- nlp
- code
- audio
- automatic-speech-recognition
- speech-summarization
- speech-translation
- phi-4-multimodal
- phi
- phi-4-mini
base_model: microsoft/Phi-4-multimodal-instruct
---
# Phi-4-Audio
**Phi-4-Audio** is a highly efficient adaptation of the [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) model, exclusively optimized for audio-text interactions (e.g., Automatic Speech Recognition).
By surgically removing the vision processing components—including the image encoder, vision projection layers, and associated processing logic—we have created a streamlined model that delivers lower memory usage while retaining the original model's powerful audio understanding capabilities.
## Usage & Performance
This model is ideal for scenarios where audio processing is the sole modality, such as transcription services, voice assistants, and audio-based QA systems. It is also well-suited for researchers aiming to fine-tune the model specifically for audio tasks without the overhead of unused vision parameters.
### Key Improvements
Comparing **Phi-4-Audio** against the original **Phi-4-multimodal-instruct** on a single NVIDIA RTX 5090 GPU:
* **Reduced Footprint:** Parameter count reduced by approximately **450 Million**.
* **Lower VRAM Usage:** Peak inference memory usage reduced by **~10% (0.84 GB saved)**.
* **Same Audio Performance:** Retains full audio-understanding capabilities while running lighter.
## Uses
### Intended Use
* **Automatic Speech Recognition (ASR):** High-fidelity transcription of spoken audio.
* **Speech Translation:** Direct speech-to-text translation.
* **Audio Summarization:** Generating summaries from audio recordings.
* **Spoken Instruction Tuning:** Fine-tuning on pure audio-text pairs.
### Out of Scope
- **Image/Vision Tasks:** This model **cannot** process images. Attempts to pass image inputs will fail or raise errors, as the vision encoders have been stripped.
## How to Get Started
The model is fully compatible with the Hugging Face `transformers` library. You can use it exactly like the original model, but inputting images is not supported.
```python
import torch
from torch import nn
from io import BytesIO
from urllib.request import urlopen
from soundfile import read
from transformers import (
AutoModelForCausalLM,
AutoProcessor,
Phi4MultimodalForCausalLM,
Phi4MultimodalModel,
)
class StrippedVisionModule(nn.Module):
def __init__(self):
super().__init__()
def forward(
self,
**kwargs,
):
raise ValueError("Vision is not supported")
def strip_vision_inplace(
model: Phi4MultimodalForCausalLM | Phi4MultimodalModel,
) -> Phi4MultimodalForCausalLM | Phi4MultimodalModel:
passed_model = model
if isinstance(model, Phi4MultimodalForCausalLM):
model = model.model
emb_ext = model.embed_tokens_extend
if hasattr(emb_ext, "image_embed"):
emb_ext.image_embed = StrippedVisionModule()
if hasattr(emb_ext.audio_embed, "down_proj_for_vision_speech"):
emb_ext.audio_embed.down_proj_for_vision_speech = StrippedVisionModule()
if hasattr(emb_ext.audio_embed, "up_proj_for_vision_speech"):
emb_ext.audio_embed.up_proj_for_vision_speech = StrippedVisionModule()
try:
torch.cuda.empty_cache()
except:
pass
return passed_model
model_path = "JacobLinCool/phi-4-audio"
device = "cuda"
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map=device, dtype=torch.bfloat16
)
processor = AutoProcessor.from_pretrained(model_path)
strip_vision_inplace(model)
audio_url = "https://huggingface.co/datasets/JacobLinCool/audio-testing/resolve/main/audio/audio-1.mp3"
audio, samplerate = read(BytesIO(urlopen(audio_url).read()))
user_prompt = "<|user|>"
assistant_prompt = "<|assistant|>"
prompt_suffix = "<|end|>"
speech_prompt = "Transcribe the audio clip into text."
prompt = f"{user_prompt}<|audio|>{speech_prompt}{prompt_suffix}{assistant_prompt}"
inputs = processor(
text=prompt, audio=[audio], sampling_rate=16000, return_tensors="pt"
).to(device)
generate_ids = model.generate(**inputs)
response = processor.batch_decode(
generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True
)[0]
print(f"{response=}")
```
## Model Details
- Base Architecture: Phi-4 Multimodal
- Modifications:
- Removed `embed_tokens_extend.image_embed`
- Removed `audio_embed.down_proj_for_vision_speech`
- Removed `audio_embed.up_proj_for_vision_speech`
## Comparisons
### Parameter Count
| Model | Total Parameters | Reduction |
| --- | --- | --- |
| Phi-4-multimodal-instruct | 4,743,988,032 (4.74B) | - |
| Phi-4-Audio | 4,289,848,960 (4.29B) | -454M |
### Benchmark Results
Tested on NVIDIA RTX 5090, `torch.bfloat16`.
| Metric | Original Model | Phi-4-Audio | Delta |
| --- | --- | --- | --- |
| Peak Memory (GB) | 8.88 GB | 8.04 GB | -0.84 GB |
| Inference Speed (Warm) | ~100.5 tokens/s | ~100.5 tokens/s | Similar |
## Citation
If you use this model version, please cite the original Phi-4 Multimodal paper and acknowledge the modifications.
```bibtex
@article{abouelenin2025phi,
title={Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras},
author={Abouelenin, Abdelrahman and Ashfaq, Atabak and Atkinson, Adam and Awadalla, Hany and Bach, Nguyen and Bao, Jianmin and Benhaim, Alon and Cai, Martin and Chaudhary, Vishrav and Chen, Congcong and others},
journal={arXiv preprint arXiv:2503.01743},
year={2025}
}
```
|