Borealis-5b-it / README.md

Update README.md

206204b verified 25 days ago

8.43 kB

	---
	license: apache-2.0
	language:
	- ru
	- en
	pipeline_tag: audio-text-to-text
	tags:
	- audio
	- speech
	- multimodal
	- whisper
	- qwen
	library_name: transformers
	datasets:
	- Vikhrmodels/AudioBooksInstructGemini2.5
	---

	# Borealis-5B-IT

	[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https%3A//huggingface.co/Vikhrmodels/Borealis-5b-it/blob/main/Borealis_Demo.ipynb)

	Borealis is an audio-language model that combines Whisper encoder with Qwen3-4B LLM for speech understanding and instruction-following tasks.

	## Model Description

	- Audio Encoder: Whisper Large V3 (frozen)
	- Language Model: Qwen3-4B (fine-tuned)
	- Adapter: 2-layer MLP projecting audio embeddings to LLM space
	- Total Parameters: ~5B
	- Languages: Russian, English

	## Installation

	```bash
	pip install transformers torch torchaudio safetensors
	```

	## Quick Start

	```python
	import torch
	import torchaudio
	from transformers import AutoModel

	# Load model
	model = AutoModel.from_pretrained(
	"Vikhrmodels/Borealis-5b-it",
	trust_remote_code=True,
	device="cuda"
	)
	model.eval()

	# Load audio
	audio, sr = torchaudio.load("your_audio.wav")
	if sr != 16000:
	audio = torchaudio.functional.resample(audio, sr, 16000)
	audio = audio.squeeze()

	# Generate response
	with torch.inference_mode():
	output_ids = model.generate(
	audio=audio,
	user_prompt="What is being said in this audio? <\|start_of_audio\|><\|end_of_audio\|>",
	system_prompt="You are a helpful voice assistant.",
	max_new_tokens=256,
	temperature=0.7,
	)

	response = model.decode(output_ids[0])
	print(response)
	```

	## Prompt Examples

	### Audio Transcription
	```python
	output = model.generate(
	audio=audio,
	user_prompt="Transcribe this audio: <\|start_of_audio\|><\|end_of_audio\|>",
	system_prompt="You are a speech recognition assistant. Accurately transcribe audio to text."
	)
	```

	### Audio Summarization
	```python
	output = model.generate(
	audio=audio,
	user_prompt="Summarize what is said in this recording: <\|start_of_audio\|><\|end_of_audio\|>",
	system_prompt="You are a helpful voice assistant."
	)
	```

	### Audio Q&A (Russian)
	```python
	output = model.generate(
	audio=audio,
	user_prompt="О чём говорится в этой аудиозаписи? <\|start_of_audio\|><\|end_of_audio\|>",
	system_prompt="Ты полезный голосовой ассистент."
	)
	```

	### Content Description
	```python
	output = model.generate(
	audio=audio,
	user_prompt="Describe in detail what you hear: <\|start_of_audio\|><\|end_of_audio\|>",
	system_prompt="You are an attentive listener."
	)
	```

	### Emotion Analysis
	```python
	output = model.generate(
	audio=audio,
	user_prompt="What emotions does the speaker express? <\|start_of_audio\|><\|end_of_audio\|>",
	system_prompt="You are an expert in audio analysis."
	)
	```

	## Training Data

	The model was fine-tuned on a diverse mix of audio-instruction datasets:

	\| Dataset \| Description \| Size \|
	\|---------\|-------------\|------\|
	\| [Vikhrmodels/Speech-Instructions](https://huggingface.co/datasets/Vikhrmodels/Speech-Instructions) \| General speech instruction-following \| 70k \|
	\| [Vikhrmodels/Speech-Describe](https://huggingface.co/datasets/Vikhrmodels/Speech-Describe) \| Audio description tasks (speech & non-speech) \| ~2M \|
	\| [Vikhrmodels/ToneBooks](https://huggingface.co/datasets/Vikhrmodels/ToneBooks) \| Russian audiobook excerpts \| - \|
	\| [Vikhrmodels/AudioBooksInstructGemini2.5](https://huggingface.co/datasets/Vikhrmodels/AudioBooksInstructGemini2.5) \| Instruction data generated with Gemini 2.5 \| - \|

	## Model Architecture

	```
	Audio Input (16kHz)
	│
	▼
	┌─────────────────┐
	│ Whisper Large V3│ (Frozen)
	│ Encoder │
	└────────┬────────┘
	│ (1280-dim embeddings)
	▼
	┌─────────────────┐
	│ Downsampler │ (4x temporal reduction)
	│ + Adapter │
	└────────┬────────┘
	│ (2560-dim embeddings)
	▼
	┌─────────────────┐
	│ Qwen3-4B │ (Fine-tuned)
	│ LLM │
	└────────┬────────┘
	│
	▼
	Text Output
	```

	## vLLM Support

	Borealis has native vLLM support through a plugin system. This enables high-performance inference with full audio processing.

	### Install vLLM Plugin

	```bash
	pip install vllm>=0.12.0

	# Clone plugin only (skip large model weights)
	GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Vikhrmodels/Borealis-5b-it
	cd Borealis-5b-it/vllm_borealis
	pip install -e .
	```

	### Basic Usage

	```python
	import librosa
	from vllm import LLM, SamplingParams

	# Load model with vLLM
	llm = LLM(
	model="Vikhrmodels/Borealis-5b-it",
	trust_remote_code=True,
	dtype="bfloat16",
	limit_mm_per_prompt={"audio": 1},
	)

	# Load audio (16kHz)
	audio, sr = librosa.load("audio.wav", sr=16000)

	# Simple prompt with audio placeholder
	prompt = "<\|AUDIO\|>Transcribe this audio."

	sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
	outputs = llm.generate(
	{
	"prompt": prompt,
	"multi_modal_data": {"audio": audio},
	},
	sampling_params=sampling_params,
	)

	print(outputs[0].outputs[0].text)
	```

	### With Chat Format

	```python
	import librosa
	from vllm import LLM, SamplingParams

	llm = LLM(
	model="Vikhrmodels/Borealis-5b-it",
	trust_remote_code=True,
	dtype="bfloat16",
	limit_mm_per_prompt={"audio": 1},
	)

	audio, sr = librosa.load("audio.wav", sr=16000)

	# Build prompt with Qwen3 chat format
	prompt = """<\|im_start\|>system
	You are a helpful voice assistant.<\|im_end\|>
	<\|im_start\|>user
	<\|AUDIO\|>What is being said in this audio?<\|im_end\|>
	<\|im_start\|>assistant
	"""

	sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
	outputs = llm.generate(
	{
	"prompt": prompt,
	"multi_modal_data": {"audio": audio},
	},
	sampling_params=sampling_params,
	)

	print(outputs[0].outputs[0].text)
	```

	### OpenAI-Compatible Server

	> Note: Install the vLLM plugin first (see above).

	```bash
	# Start vLLM server
	vllm serve Vikhrmodels/Borealis-5b-it \
	--trust-remote-code \
	--dtype bfloat16 \
	--limit-mm-per-prompt audio=1
	```

	### How It Works

	The vLLM plugin processes audio through the full Borealis pipeline:

	```
	Audio (numpy array, 16kHz)
	↓ WhisperFeatureExtractor
	Mel spectrogram [128, 3000]
	↓ WhisperEncoder (frozen)
	Encoder output [1500, 1280]
	↓ Downsample 4x (concat adjacent frames)
	[375, 5120]
	↓ AudioLanguageAdapter (2-layer MLP)
	Audio embeddings [375, 2560]
	↓ Replace <\|AUDIO\|> tokens
	↓ Qwen3-4B LLM (vLLM optimized)
	Generated text
	```

	Each 30-second audio clip produces 375 audio tokens in the sequence.

	### Benchmark Results

	Tested on NVIDIA A100 with 30s audio input, 128 max tokens:

	\| Method \| Throughput \| Speedup \|
	\|--------\|------------\|---------\|
	\| HuggingFace (native) \| 44.9 tok/s \| 1.0x \|
	\| vLLM (plugin) \| 95.9 tok/s \| 2.1x \|

	vLLM provides ~2x speedup over HuggingFace with full audio processing support.

	### ASR Benchmarks (WER / CER)

	\| Split \| Borealis baseline \| Borealis step-2898 \| Whisper-v3 \|
	\|---------------------\|-------------------\|--------------------\|------------\|
	\| Russian_LibriSpeech \| 6.63% \| 5.64% \| 11.68% \|
	\| Common_Voice \| 8.88% \| 12.67% \| 12.23% \|
	\| Tone_Webinars \| 56.87% \| 60.55% \| 7.77% \|
	\| Tone_Books \| 6.03% \| 5.25% \| 11.95% \|
	\| Tone_Speak \| 4.63% \| 6.49% \| 2.68% \|
	\| Sova_RuDevices \| 17.28% \| 21.57% \| 19.87% \|

	Baseline: Whisper Large V3. Lower is better.

	## Limitations

	- Optimized for audio up to 30 seconds
	- Best performance on Russian and English
	- May not handle heavily noisy audio well

	## Citation

	```bibtex
	@misc{borealis2025,
	title={Borealis: Audio-Language Model for Speech Understanding},
	author={VikhrModels},
	year={2025},
	publisher={HuggingFace},
	url={https://huggingface.co/Vikhrmodels/Borealis-5b-it}
	}
	```

	## License

	Apache 2.0