|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- ru |
|
|
- en |
|
|
pipeline_tag: audio-text-to-text |
|
|
tags: |
|
|
- audio |
|
|
- speech |
|
|
- multimodal |
|
|
- whisper |
|
|
- qwen |
|
|
library_name: transformers |
|
|
datasets: |
|
|
- Vikhrmodels/AudioBooksInstructGemini2.5 |
|
|
--- |
|
|
|
|
|
# Borealis-5B-IT |
|
|
|
|
|
[](https://colab.research.google.com/#fileId=https%3A//huggingface.co/Vikhrmodels/Borealis-5b-it/blob/main/Borealis_Demo.ipynb) |
|
|
|
|
|
Borealis is an audio-language model that combines Whisper encoder with Qwen3-4B LLM for speech understanding and instruction-following tasks. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Audio Encoder**: Whisper Large V3 (frozen) |
|
|
- **Language Model**: Qwen3-4B (fine-tuned) |
|
|
- **Adapter**: 2-layer MLP projecting audio embeddings to LLM space |
|
|
- **Total Parameters**: ~5B |
|
|
- **Languages**: Russian, English |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers torch torchaudio safetensors |
|
|
``` |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import torchaudio |
|
|
from transformers import AutoModel |
|
|
|
|
|
# Load model |
|
|
model = AutoModel.from_pretrained( |
|
|
"Vikhrmodels/Borealis-5b-it", |
|
|
trust_remote_code=True, |
|
|
device="cuda" |
|
|
) |
|
|
model.eval() |
|
|
|
|
|
# Load audio |
|
|
audio, sr = torchaudio.load("your_audio.wav") |
|
|
if sr != 16000: |
|
|
audio = torchaudio.functional.resample(audio, sr, 16000) |
|
|
audio = audio.squeeze() |
|
|
|
|
|
# Generate response |
|
|
with torch.inference_mode(): |
|
|
output_ids = model.generate( |
|
|
audio=audio, |
|
|
user_prompt="What is being said in this audio? <|start_of_audio|><|end_of_audio|>", |
|
|
system_prompt="You are a helpful voice assistant.", |
|
|
max_new_tokens=256, |
|
|
temperature=0.7, |
|
|
) |
|
|
|
|
|
response = model.decode(output_ids[0]) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
## Prompt Examples |
|
|
|
|
|
### Audio Transcription |
|
|
```python |
|
|
output = model.generate( |
|
|
audio=audio, |
|
|
user_prompt="Transcribe this audio: <|start_of_audio|><|end_of_audio|>", |
|
|
system_prompt="You are a speech recognition assistant. Accurately transcribe audio to text." |
|
|
) |
|
|
``` |
|
|
|
|
|
### Audio Summarization |
|
|
```python |
|
|
output = model.generate( |
|
|
audio=audio, |
|
|
user_prompt="Summarize what is said in this recording: <|start_of_audio|><|end_of_audio|>", |
|
|
system_prompt="You are a helpful voice assistant." |
|
|
) |
|
|
``` |
|
|
|
|
|
### Audio Q&A (Russian) |
|
|
```python |
|
|
output = model.generate( |
|
|
audio=audio, |
|
|
user_prompt="О чём говорится в этой аудиозаписи? <|start_of_audio|><|end_of_audio|>", |
|
|
system_prompt="Ты полезный голосовой ассистент." |
|
|
) |
|
|
``` |
|
|
|
|
|
### Content Description |
|
|
```python |
|
|
output = model.generate( |
|
|
audio=audio, |
|
|
user_prompt="Describe in detail what you hear: <|start_of_audio|><|end_of_audio|>", |
|
|
system_prompt="You are an attentive listener." |
|
|
) |
|
|
``` |
|
|
|
|
|
### Emotion Analysis |
|
|
```python |
|
|
output = model.generate( |
|
|
audio=audio, |
|
|
user_prompt="What emotions does the speaker express? <|start_of_audio|><|end_of_audio|>", |
|
|
system_prompt="You are an expert in audio analysis." |
|
|
) |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was fine-tuned on a diverse mix of audio-instruction datasets: |
|
|
|
|
|
| Dataset | Description | Size | |
|
|
|---------|-------------|------| |
|
|
| [Vikhrmodels/Speech-Instructions](https://huggingface.co/datasets/Vikhrmodels/Speech-Instructions) | General speech instruction-following | 70k | |
|
|
| [Vikhrmodels/Speech-Describe](https://huggingface.co/datasets/Vikhrmodels/Speech-Describe) | Audio description tasks (speech & non-speech) | ~2M | |
|
|
| [Vikhrmodels/ToneBooks](https://huggingface.co/datasets/Vikhrmodels/ToneBooks) | Russian audiobook excerpts | - | |
|
|
| [Vikhrmodels/AudioBooksInstructGemini2.5](https://huggingface.co/datasets/Vikhrmodels/AudioBooksInstructGemini2.5) | Instruction data generated with Gemini 2.5 | - | |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
``` |
|
|
Audio Input (16kHz) |
|
|
│ |
|
|
▼ |
|
|
┌─────────────────┐ |
|
|
│ Whisper Large V3│ (Frozen) |
|
|
│ Encoder │ |
|
|
└────────┬────────┘ |
|
|
│ (1280-dim embeddings) |
|
|
▼ |
|
|
┌─────────────────┐ |
|
|
│ Downsampler │ (4x temporal reduction) |
|
|
│ + Adapter │ |
|
|
└────────┬────────┘ |
|
|
│ (2560-dim embeddings) |
|
|
▼ |
|
|
┌─────────────────┐ |
|
|
│ Qwen3-4B │ (Fine-tuned) |
|
|
│ LLM │ |
|
|
└────────┬────────┘ |
|
|
│ |
|
|
▼ |
|
|
Text Output |
|
|
``` |
|
|
|
|
|
## vLLM Support |
|
|
|
|
|
Borealis has native vLLM support through a plugin system. This enables high-performance inference with full audio processing. |
|
|
|
|
|
### Install vLLM Plugin |
|
|
|
|
|
```bash |
|
|
pip install vllm>=0.12.0 |
|
|
|
|
|
# Clone plugin only (skip large model weights) |
|
|
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Vikhrmodels/Borealis-5b-it |
|
|
cd Borealis-5b-it/vllm_borealis |
|
|
pip install -e . |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
import librosa |
|
|
from vllm import LLM, SamplingParams |
|
|
|
|
|
# Load model with vLLM |
|
|
llm = LLM( |
|
|
model="Vikhrmodels/Borealis-5b-it", |
|
|
trust_remote_code=True, |
|
|
dtype="bfloat16", |
|
|
limit_mm_per_prompt={"audio": 1}, |
|
|
) |
|
|
|
|
|
# Load audio (16kHz) |
|
|
audio, sr = librosa.load("audio.wav", sr=16000) |
|
|
|
|
|
# Simple prompt with audio placeholder |
|
|
prompt = "<|AUDIO|>Transcribe this audio." |
|
|
|
|
|
sampling_params = SamplingParams(temperature=0.7, max_tokens=512) |
|
|
outputs = llm.generate( |
|
|
{ |
|
|
"prompt": prompt, |
|
|
"multi_modal_data": {"audio": audio}, |
|
|
}, |
|
|
sampling_params=sampling_params, |
|
|
) |
|
|
|
|
|
print(outputs[0].outputs[0].text) |
|
|
``` |
|
|
|
|
|
### With Chat Format |
|
|
|
|
|
```python |
|
|
import librosa |
|
|
from vllm import LLM, SamplingParams |
|
|
|
|
|
llm = LLM( |
|
|
model="Vikhrmodels/Borealis-5b-it", |
|
|
trust_remote_code=True, |
|
|
dtype="bfloat16", |
|
|
limit_mm_per_prompt={"audio": 1}, |
|
|
) |
|
|
|
|
|
audio, sr = librosa.load("audio.wav", sr=16000) |
|
|
|
|
|
# Build prompt with Qwen3 chat format |
|
|
prompt = """<|im_start|>system |
|
|
You are a helpful voice assistant.<|im_end|> |
|
|
<|im_start|>user |
|
|
<|AUDIO|>What is being said in this audio?<|im_end|> |
|
|
<|im_start|>assistant |
|
|
""" |
|
|
|
|
|
sampling_params = SamplingParams(temperature=0.7, max_tokens=512) |
|
|
outputs = llm.generate( |
|
|
{ |
|
|
"prompt": prompt, |
|
|
"multi_modal_data": {"audio": audio}, |
|
|
}, |
|
|
sampling_params=sampling_params, |
|
|
) |
|
|
|
|
|
print(outputs[0].outputs[0].text) |
|
|
``` |
|
|
|
|
|
### OpenAI-Compatible Server |
|
|
|
|
|
> **Note**: Install the vLLM plugin first (see above). |
|
|
|
|
|
```bash |
|
|
# Start vLLM server |
|
|
vllm serve Vikhrmodels/Borealis-5b-it \ |
|
|
--trust-remote-code \ |
|
|
--dtype bfloat16 \ |
|
|
--limit-mm-per-prompt audio=1 |
|
|
``` |
|
|
|
|
|
### How It Works |
|
|
|
|
|
The vLLM plugin processes audio through the full Borealis pipeline: |
|
|
|
|
|
``` |
|
|
Audio (numpy array, 16kHz) |
|
|
↓ WhisperFeatureExtractor |
|
|
Mel spectrogram [128, 3000] |
|
|
↓ WhisperEncoder (frozen) |
|
|
Encoder output [1500, 1280] |
|
|
↓ Downsample 4x (concat adjacent frames) |
|
|
[375, 5120] |
|
|
↓ AudioLanguageAdapter (2-layer MLP) |
|
|
Audio embeddings [375, 2560] |
|
|
↓ Replace <|AUDIO|> tokens |
|
|
↓ Qwen3-4B LLM (vLLM optimized) |
|
|
Generated text |
|
|
``` |
|
|
|
|
|
Each 30-second audio clip produces **375 audio tokens** in the sequence. |
|
|
|
|
|
### Benchmark Results |
|
|
|
|
|
Tested on NVIDIA A100 with 30s audio input, 128 max tokens: |
|
|
|
|
|
| Method | Throughput | Speedup | |
|
|
|--------|------------|---------| |
|
|
| HuggingFace (native) | 44.9 tok/s | 1.0x | |
|
|
| **vLLM (plugin)** | **95.9 tok/s** | **2.1x** | |
|
|
|
|
|
vLLM provides ~2x speedup over HuggingFace with full audio processing support. |
|
|
|
|
|
### ASR Benchmarks (WER / CER) |
|
|
|
|
|
| Split | Borealis baseline | Borealis step-2898 | Whisper-v3 | |
|
|
|---------------------|-------------------|--------------------|------------| |
|
|
| Russian_LibriSpeech | 6.63% | 5.64% | 11.68% | |
|
|
| Common_Voice | 8.88% | 12.67% | 12.23% | |
|
|
| Tone_Webinars | 56.87% | 60.55% | 7.77% | |
|
|
| Tone_Books | 6.03% | 5.25% | 11.95% | |
|
|
| Tone_Speak | 4.63% | 6.49% | 2.68% | |
|
|
| Sova_RuDevices | 17.28% | 21.57% | 19.87% | |
|
|
|
|
|
*Baseline: Whisper Large V3. Lower is better.* |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Optimized for audio up to 30 seconds |
|
|
- Best performance on Russian and English |
|
|
- May not handle heavily noisy audio well |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{borealis2025, |
|
|
title={Borealis: Audio-Language Model for Speech Understanding}, |
|
|
author={VikhrModels}, |
|
|
year={2025}, |
|
|
publisher={HuggingFace}, |
|
|
url={https://huggingface.co/Vikhrmodels/Borealis-5b-it} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |