Borealis-5b-it / README.md
AlexWortega's picture
Update README.md
206204b verified
---
license: apache-2.0
language:
- ru
- en
pipeline_tag: audio-text-to-text
tags:
- audio
- speech
- multimodal
- whisper
- qwen
library_name: transformers
datasets:
- Vikhrmodels/AudioBooksInstructGemini2.5
---
# Borealis-5B-IT
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https%3A//huggingface.co/Vikhrmodels/Borealis-5b-it/blob/main/Borealis_Demo.ipynb)
Borealis is an audio-language model that combines Whisper encoder with Qwen3-4B LLM for speech understanding and instruction-following tasks.
## Model Description
- **Audio Encoder**: Whisper Large V3 (frozen)
- **Language Model**: Qwen3-4B (fine-tuned)
- **Adapter**: 2-layer MLP projecting audio embeddings to LLM space
- **Total Parameters**: ~5B
- **Languages**: Russian, English
## Installation
```bash
pip install transformers torch torchaudio safetensors
```
## Quick Start
```python
import torch
import torchaudio
from transformers import AutoModel
# Load model
model = AutoModel.from_pretrained(
"Vikhrmodels/Borealis-5b-it",
trust_remote_code=True,
device="cuda"
)
model.eval()
# Load audio
audio, sr = torchaudio.load("your_audio.wav")
if sr != 16000:
audio = torchaudio.functional.resample(audio, sr, 16000)
audio = audio.squeeze()
# Generate response
with torch.inference_mode():
output_ids = model.generate(
audio=audio,
user_prompt="What is being said in this audio? <|start_of_audio|><|end_of_audio|>",
system_prompt="You are a helpful voice assistant.",
max_new_tokens=256,
temperature=0.7,
)
response = model.decode(output_ids[0])
print(response)
```
## Prompt Examples
### Audio Transcription
```python
output = model.generate(
audio=audio,
user_prompt="Transcribe this audio: <|start_of_audio|><|end_of_audio|>",
system_prompt="You are a speech recognition assistant. Accurately transcribe audio to text."
)
```
### Audio Summarization
```python
output = model.generate(
audio=audio,
user_prompt="Summarize what is said in this recording: <|start_of_audio|><|end_of_audio|>",
system_prompt="You are a helpful voice assistant."
)
```
### Audio Q&A (Russian)
```python
output = model.generate(
audio=audio,
user_prompt="О чём говорится в этой аудиозаписи? <|start_of_audio|><|end_of_audio|>",
system_prompt="Ты полезный голосовой ассистент."
)
```
### Content Description
```python
output = model.generate(
audio=audio,
user_prompt="Describe in detail what you hear: <|start_of_audio|><|end_of_audio|>",
system_prompt="You are an attentive listener."
)
```
### Emotion Analysis
```python
output = model.generate(
audio=audio,
user_prompt="What emotions does the speaker express? <|start_of_audio|><|end_of_audio|>",
system_prompt="You are an expert in audio analysis."
)
```
## Training Data
The model was fine-tuned on a diverse mix of audio-instruction datasets:
| Dataset | Description | Size |
|---------|-------------|------|
| [Vikhrmodels/Speech-Instructions](https://huggingface.co/datasets/Vikhrmodels/Speech-Instructions) | General speech instruction-following | 70k |
| [Vikhrmodels/Speech-Describe](https://huggingface.co/datasets/Vikhrmodels/Speech-Describe) | Audio description tasks (speech & non-speech) | ~2M |
| [Vikhrmodels/ToneBooks](https://huggingface.co/datasets/Vikhrmodels/ToneBooks) | Russian audiobook excerpts | - |
| [Vikhrmodels/AudioBooksInstructGemini2.5](https://huggingface.co/datasets/Vikhrmodels/AudioBooksInstructGemini2.5) | Instruction data generated with Gemini 2.5 | - |
## Model Architecture
```
Audio Input (16kHz)
┌─────────────────┐
│ Whisper Large V3│ (Frozen)
│ Encoder │
└────────┬────────┘
│ (1280-dim embeddings)
┌─────────────────┐
│ Downsampler │ (4x temporal reduction)
│ + Adapter │
└────────┬────────┘
│ (2560-dim embeddings)
┌─────────────────┐
│ Qwen3-4B │ (Fine-tuned)
│ LLM │
└────────┬────────┘
Text Output
```
## vLLM Support
Borealis has native vLLM support through a plugin system. This enables high-performance inference with full audio processing.
### Install vLLM Plugin
```bash
pip install vllm>=0.12.0
# Clone plugin only (skip large model weights)
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Vikhrmodels/Borealis-5b-it
cd Borealis-5b-it/vllm_borealis
pip install -e .
```
### Basic Usage
```python
import librosa
from vllm import LLM, SamplingParams
# Load model with vLLM
llm = LLM(
model="Vikhrmodels/Borealis-5b-it",
trust_remote_code=True,
dtype="bfloat16",
limit_mm_per_prompt={"audio": 1},
)
# Load audio (16kHz)
audio, sr = librosa.load("audio.wav", sr=16000)
# Simple prompt with audio placeholder
prompt = "<|AUDIO|>Transcribe this audio."
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(
{
"prompt": prompt,
"multi_modal_data": {"audio": audio},
},
sampling_params=sampling_params,
)
print(outputs[0].outputs[0].text)
```
### With Chat Format
```python
import librosa
from vllm import LLM, SamplingParams
llm = LLM(
model="Vikhrmodels/Borealis-5b-it",
trust_remote_code=True,
dtype="bfloat16",
limit_mm_per_prompt={"audio": 1},
)
audio, sr = librosa.load("audio.wav", sr=16000)
# Build prompt with Qwen3 chat format
prompt = """<|im_start|>system
You are a helpful voice assistant.<|im_end|>
<|im_start|>user
<|AUDIO|>What is being said in this audio?<|im_end|>
<|im_start|>assistant
"""
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(
{
"prompt": prompt,
"multi_modal_data": {"audio": audio},
},
sampling_params=sampling_params,
)
print(outputs[0].outputs[0].text)
```
### OpenAI-Compatible Server
> **Note**: Install the vLLM plugin first (see above).
```bash
# Start vLLM server
vllm serve Vikhrmodels/Borealis-5b-it \
--trust-remote-code \
--dtype bfloat16 \
--limit-mm-per-prompt audio=1
```
### How It Works
The vLLM plugin processes audio through the full Borealis pipeline:
```
Audio (numpy array, 16kHz)
↓ WhisperFeatureExtractor
Mel spectrogram [128, 3000]
↓ WhisperEncoder (frozen)
Encoder output [1500, 1280]
↓ Downsample 4x (concat adjacent frames)
[375, 5120]
↓ AudioLanguageAdapter (2-layer MLP)
Audio embeddings [375, 2560]
↓ Replace <|AUDIO|> tokens
↓ Qwen3-4B LLM (vLLM optimized)
Generated text
```
Each 30-second audio clip produces **375 audio tokens** in the sequence.
### Benchmark Results
Tested on NVIDIA A100 with 30s audio input, 128 max tokens:
| Method | Throughput | Speedup |
|--------|------------|---------|
| HuggingFace (native) | 44.9 tok/s | 1.0x |
| **vLLM (plugin)** | **95.9 tok/s** | **2.1x** |
vLLM provides ~2x speedup over HuggingFace with full audio processing support.
### ASR Benchmarks (WER / CER)
| Split | Borealis baseline | Borealis step-2898 | Whisper-v3 |
|---------------------|-------------------|--------------------|------------|
| Russian_LibriSpeech | 6.63% | 5.64% | 11.68% |
| Common_Voice | 8.88% | 12.67% | 12.23% |
| Tone_Webinars | 56.87% | 60.55% | 7.77% |
| Tone_Books | 6.03% | 5.25% | 11.95% |
| Tone_Speak | 4.63% | 6.49% | 2.68% |
| Sova_RuDevices | 17.28% | 21.57% | 19.87% |
*Baseline: Whisper Large V3. Lower is better.*
## Limitations
- Optimized for audio up to 30 seconds
- Best performance on Russian and English
- May not handle heavily noisy audio well
## Citation
```bibtex
@misc{borealis2025,
title={Borealis: Audio-Language Model for Speech Understanding},
author={VikhrModels},
year={2025},
publisher={HuggingFace},
url={https://huggingface.co/Vikhrmodels/Borealis-5b-it}
}
```
## License
Apache 2.0