--- license: apache-2.0 language: - ru - en pipeline_tag: audio-text-to-text tags: - audio - speech - multimodal - whisper - qwen library_name: transformers datasets: - Vikhrmodels/AudioBooksInstructGemini2.5 --- # Borealis-5B-IT [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https%3A//huggingface.co/Vikhrmodels/Borealis-5b-it/blob/main/Borealis_Demo.ipynb) Borealis is an audio-language model that combines Whisper encoder with Qwen3-4B LLM for speech understanding and instruction-following tasks. ## Model Description - **Audio Encoder**: Whisper Large V3 (frozen) - **Language Model**: Qwen3-4B (fine-tuned) - **Adapter**: 2-layer MLP projecting audio embeddings to LLM space - **Total Parameters**: ~5B - **Languages**: Russian, English ## Installation ```bash pip install transformers torch torchaudio safetensors ``` ## Quick Start ```python import torch import torchaudio from transformers import AutoModel # Load model model = AutoModel.from_pretrained( "Vikhrmodels/Borealis-5b-it", trust_remote_code=True, device="cuda" ) model.eval() # Load audio audio, sr = torchaudio.load("your_audio.wav") if sr != 16000: audio = torchaudio.functional.resample(audio, sr, 16000) audio = audio.squeeze() # Generate response with torch.inference_mode(): output_ids = model.generate( audio=audio, user_prompt="What is being said in this audio? <|start_of_audio|><|end_of_audio|>", system_prompt="You are a helpful voice assistant.", max_new_tokens=256, temperature=0.7, ) response = model.decode(output_ids[0]) print(response) ``` ## Prompt Examples ### Audio Transcription ```python output = model.generate( audio=audio, user_prompt="Transcribe this audio: <|start_of_audio|><|end_of_audio|>", system_prompt="You are a speech recognition assistant. Accurately transcribe audio to text." ) ``` ### Audio Summarization ```python output = model.generate( audio=audio, user_prompt="Summarize what is said in this recording: <|start_of_audio|><|end_of_audio|>", system_prompt="You are a helpful voice assistant." ) ``` ### Audio Q&A (Russian) ```python output = model.generate( audio=audio, user_prompt="О чём говорится в этой аудиозаписи? <|start_of_audio|><|end_of_audio|>", system_prompt="Ты полезный голосовой ассистент." ) ``` ### Content Description ```python output = model.generate( audio=audio, user_prompt="Describe in detail what you hear: <|start_of_audio|><|end_of_audio|>", system_prompt="You are an attentive listener." ) ``` ### Emotion Analysis ```python output = model.generate( audio=audio, user_prompt="What emotions does the speaker express? <|start_of_audio|><|end_of_audio|>", system_prompt="You are an expert in audio analysis." ) ``` ## Training Data The model was fine-tuned on a diverse mix of audio-instruction datasets: | Dataset | Description | Size | |---------|-------------|------| | [Vikhrmodels/Speech-Instructions](https://huggingface.co/datasets/Vikhrmodels/Speech-Instructions) | General speech instruction-following | 70k | | [Vikhrmodels/Speech-Describe](https://huggingface.co/datasets/Vikhrmodels/Speech-Describe) | Audio description tasks (speech & non-speech) | ~2M | | [Vikhrmodels/ToneBooks](https://huggingface.co/datasets/Vikhrmodels/ToneBooks) | Russian audiobook excerpts | - | | [Vikhrmodels/AudioBooksInstructGemini2.5](https://huggingface.co/datasets/Vikhrmodels/AudioBooksInstructGemini2.5) | Instruction data generated with Gemini 2.5 | - | ## Model Architecture ``` Audio Input (16kHz) │ ▼ ┌─────────────────┐ │ Whisper Large V3│ (Frozen) │ Encoder │ └────────┬────────┘ │ (1280-dim embeddings) ▼ ┌─────────────────┐ │ Downsampler │ (4x temporal reduction) │ + Adapter │ └────────┬────────┘ │ (2560-dim embeddings) ▼ ┌─────────────────┐ │ Qwen3-4B │ (Fine-tuned) │ LLM │ └────────┬────────┘ │ ▼ Text Output ``` ## vLLM Support Borealis has native vLLM support through a plugin system. This enables high-performance inference with full audio processing. ### Install vLLM Plugin ```bash pip install vllm>=0.12.0 # Clone plugin only (skip large model weights) GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Vikhrmodels/Borealis-5b-it cd Borealis-5b-it/vllm_borealis pip install -e . ``` ### Basic Usage ```python import librosa from vllm import LLM, SamplingParams # Load model with vLLM llm = LLM( model="Vikhrmodels/Borealis-5b-it", trust_remote_code=True, dtype="bfloat16", limit_mm_per_prompt={"audio": 1}, ) # Load audio (16kHz) audio, sr = librosa.load("audio.wav", sr=16000) # Simple prompt with audio placeholder prompt = "<|AUDIO|>Transcribe this audio." sampling_params = SamplingParams(temperature=0.7, max_tokens=512) outputs = llm.generate( { "prompt": prompt, "multi_modal_data": {"audio": audio}, }, sampling_params=sampling_params, ) print(outputs[0].outputs[0].text) ``` ### With Chat Format ```python import librosa from vllm import LLM, SamplingParams llm = LLM( model="Vikhrmodels/Borealis-5b-it", trust_remote_code=True, dtype="bfloat16", limit_mm_per_prompt={"audio": 1}, ) audio, sr = librosa.load("audio.wav", sr=16000) # Build prompt with Qwen3 chat format prompt = """<|im_start|>system You are a helpful voice assistant.<|im_end|> <|im_start|>user <|AUDIO|>What is being said in this audio?<|im_end|> <|im_start|>assistant """ sampling_params = SamplingParams(temperature=0.7, max_tokens=512) outputs = llm.generate( { "prompt": prompt, "multi_modal_data": {"audio": audio}, }, sampling_params=sampling_params, ) print(outputs[0].outputs[0].text) ``` ### OpenAI-Compatible Server > **Note**: Install the vLLM plugin first (see above). ```bash # Start vLLM server vllm serve Vikhrmodels/Borealis-5b-it \ --trust-remote-code \ --dtype bfloat16 \ --limit-mm-per-prompt audio=1 ``` ### How It Works The vLLM plugin processes audio through the full Borealis pipeline: ``` Audio (numpy array, 16kHz) ↓ WhisperFeatureExtractor Mel spectrogram [128, 3000] ↓ WhisperEncoder (frozen) Encoder output [1500, 1280] ↓ Downsample 4x (concat adjacent frames) [375, 5120] ↓ AudioLanguageAdapter (2-layer MLP) Audio embeddings [375, 2560] ↓ Replace <|AUDIO|> tokens ↓ Qwen3-4B LLM (vLLM optimized) Generated text ``` Each 30-second audio clip produces **375 audio tokens** in the sequence. ### Benchmark Results Tested on NVIDIA A100 with 30s audio input, 128 max tokens: | Method | Throughput | Speedup | |--------|------------|---------| | HuggingFace (native) | 44.9 tok/s | 1.0x | | **vLLM (plugin)** | **95.9 tok/s** | **2.1x** | vLLM provides ~2x speedup over HuggingFace with full audio processing support. ### ASR Benchmarks (WER / CER) | Split | Borealis baseline | Borealis step-2898 | Whisper-v3 | |---------------------|-------------------|--------------------|------------| | Russian_LibriSpeech | 6.63% | 5.64% | 11.68% | | Common_Voice | 8.88% | 12.67% | 12.23% | | Tone_Webinars | 56.87% | 60.55% | 7.77% | | Tone_Books | 6.03% | 5.25% | 11.95% | | Tone_Speak | 4.63% | 6.49% | 2.68% | | Sova_RuDevices | 17.28% | 21.57% | 19.87% | *Baseline: Whisper Large V3. Lower is better.* ## Limitations - Optimized for audio up to 30 seconds - Best performance on Russian and English - May not handle heavily noisy audio well ## Citation ```bibtex @misc{borealis2025, title={Borealis: Audio-Language Model for Speech Understanding}, author={VikhrModels}, year={2025}, publisher={HuggingFace}, url={https://huggingface.co/Vikhrmodels/Borealis-5b-it} } ``` ## License Apache 2.0