File size: 5,770 Bytes

109a921
 
7baebb8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109a921
 
7baebb8
109a921
7baebb8
109a921
7baebb8
109a921
7baebb8
109a921
7baebb8
109a921
7baebb8
109a921
7baebb8
109a921
7baebb8
 
 
109a921
 
 
7baebb8
109a921
7baebb8
 
 
 
109a921
7baebb8
109a921
7baebb8
109a921
7baebb8
109a921
7baebb8
109a921
7baebb8
 
 
 
 
 
 
 
 
 
 
 
109a921
 
7baebb8
 
 
109a921
7baebb8
 
 
 
 
109a921
 
7baebb8
 
 
 
109a921
7baebb8
 
109a921
7baebb8
 
 
 
 
 
 
109a921
7baebb8
 
 
 
109a921
7baebb8
109a921
 
7baebb8
 
 
 
 
 
 
109a921
 
7baebb8
 
109a921
7baebb8
 
 
 
 
109a921
7baebb8
 
 
109a921
7baebb8
 
 
 
109a921
7baebb8
 
109a921
7baebb8
109a921
7baebb8
 
 
 
 
109a921
7baebb8
109a921
7baebb8
109a921
7baebb8
 
 
 
109a921
66ef993
109a921
7baebb8
109a921
7baebb8
 
 
 
109a921
7baebb8
109a921
7baebb8
109a921
7baebb8

---
library_name: transformers
license: mit
language:
- multilingual
- ar
- zh
- cs
- da
- nl
- en
- fi
- fr
- de
- he
- hu
- it
- ja
- ko
- no
- pl
- pt
- ru
- es
- sv
- th
- tr
- uk
tags:
- nlp
- code
- audio
- automatic-speech-recognition
- speech-summarization
- speech-translation
- phi-4-multimodal
- phi
- phi-4-mini
base_model: microsoft/Phi-4-multimodal-instruct
---

# Phi-4-Audio

**Phi-4-Audio** is a highly efficient adaptation of the [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) model, exclusively optimized for audio-text interactions (e.g., Automatic Speech Recognition).

By surgically removing the vision processing components—including the image encoder, vision projection layers, and associated processing logic—we have created a streamlined model that delivers lower memory usage while retaining the original model's powerful audio understanding capabilities.

## Usage & Performance

This model is ideal for scenarios where audio processing is the sole modality, such as transcription services, voice assistants, and audio-based QA systems. It is also well-suited for researchers aiming to fine-tune the model specifically for audio tasks without the overhead of unused vision parameters.

### Key Improvements

Comparing **Phi-4-Audio** against the original **Phi-4-multimodal-instruct** on a single NVIDIA RTX 5090 GPU:

* **Reduced Footprint:** Parameter count reduced by approximately **450 Million**.
* **Lower VRAM Usage:** Peak inference memory usage reduced by **~10% (0.84 GB saved)**.
* **Same Audio Performance:** Retains full audio-understanding capabilities while running lighter.

## Uses

### Intended Use

* **Automatic Speech Recognition (ASR):** High-fidelity transcription of spoken audio.
* **Speech Translation:** Direct speech-to-text translation.
* **Audio Summarization:** Generating summaries from audio recordings.
* **Spoken Instruction Tuning:** Fine-tuning on pure audio-text pairs.

### Out of Scope

-   **Image/Vision Tasks:** This model **cannot** process images. Attempts to pass image inputs will fail or raise errors, as the vision encoders have been stripped.

## How to Get Started

The model is fully compatible with the Hugging Face `transformers` library. You can use it exactly like the original model, but inputting images is not supported.

```python
import torch
from torch import nn
from io import BytesIO
from urllib.request import urlopen
from soundfile import read
from transformers import (
    AutoModelForCausalLM,
    AutoProcessor,
    Phi4MultimodalForCausalLM,
    Phi4MultimodalModel,
)


class StrippedVisionModule(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(
        self,
        **kwargs,
    ):
        raise ValueError("Vision is not supported")


def strip_vision_inplace(
    model: Phi4MultimodalForCausalLM | Phi4MultimodalModel,
) -> Phi4MultimodalForCausalLM | Phi4MultimodalModel:
    passed_model = model

    if isinstance(model, Phi4MultimodalForCausalLM):
        model = model.model

    emb_ext = model.embed_tokens_extend
    if hasattr(emb_ext, "image_embed"):
        emb_ext.image_embed = StrippedVisionModule()
    if hasattr(emb_ext.audio_embed, "down_proj_for_vision_speech"):
        emb_ext.audio_embed.down_proj_for_vision_speech = StrippedVisionModule()
    if hasattr(emb_ext.audio_embed, "up_proj_for_vision_speech"):
        emb_ext.audio_embed.up_proj_for_vision_speech = StrippedVisionModule()

    try:
        torch.cuda.empty_cache()
    except:
        pass

    return passed_model


model_path = "JacobLinCool/phi-4-audio"
device = "cuda"
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map=device, dtype=torch.bfloat16
)
processor = AutoProcessor.from_pretrained(model_path)
strip_vision_inplace(model)


audio_url = "https://huggingface.co/datasets/JacobLinCool/audio-testing/resolve/main/audio/audio-1.mp3"
audio, samplerate = read(BytesIO(urlopen(audio_url).read()))

user_prompt = "<|user|>"
assistant_prompt = "<|assistant|>"
prompt_suffix = "<|end|>"
speech_prompt = "Transcribe the audio clip into text."
prompt = f"{user_prompt}<|audio|>{speech_prompt}{prompt_suffix}{assistant_prompt}"

inputs = processor(
    text=prompt, audio=[audio], sampling_rate=16000, return_tensors="pt"
).to(device)

generate_ids = model.generate(**inputs)
response = processor.batch_decode(
    generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True
)[0]

print(f"{response=}")
```

## Model Details

- Base Architecture: Phi-4 Multimodal
- Modifications:
  - Removed `embed_tokens_extend.image_embed`
  - Removed `audio_embed.down_proj_for_vision_speech`
  - Removed `audio_embed.up_proj_for_vision_speech`

## Comparisons

### Parameter Count

| Model | Total Parameters | Reduction |
| --- | --- | --- |
| Phi-4-multimodal-instruct | 4,743,988,032 (4.74B) | - |
| Phi-4-Audio | 4,289,848,960 (4.29B) | -454M |

### Benchmark Results

Tested on NVIDIA RTX 5090, `torch.bfloat16`.

| Metric | Original Model | Phi-4-Audio | Delta |
| --- | --- | --- | --- |
| Peak Memory (GB) | 8.88 GB | 8.04 GB | -0.84 GB |
| Inference Speed (Warm) | ~100.5 tokens/s | ~100.5 tokens/s | Similar |

## Citation

If you use this model version, please cite the original Phi-4 Multimodal paper and acknowledge the modifications.

```bibtex
@article{abouelenin2025phi,
  title={Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras},
  author={Abouelenin, Abdelrahman and Ashfaq, Atabak and Atkinson, Adam and Awadalla, Hany and Bach, Nguyen and Bao, Jianmin and Benhaim, Alon and Cai, Martin and Chaudhary, Vishrav and Chen, Congcong and others},
  journal={arXiv preprint arXiv:2503.01743},
  year={2025}
}
```