Vikhrmodels
/

Borealis-5b-it

@@ -15,17 +15,7 @@ library_name: transformers
 # Borealis-5B-IT
-## Benchmarks
-  | Split                    | WER    | CER    | Samples |
-  |--------------------------|--------|--------|---------|
-  | Russian_LibriSpeech      | 6.63%  | 3.49%  | 1000    |
-  | Common_Voice_Corpus_22.0 | 8.88%  | 5.04%  | 1000    |
-  | Tone_Webinars            | 56.87% | 52.47% | 1000    |
-  | Tone_Books               | 6.03%  | 3.75%  | 1000    |
-  | Tone_Speak               | 4.63%  | 3.38%  | 700     |
-  | Sova_RuDevices           | 17.28% | 8.03%  | 1000    |
 ## Model Description
@@ -161,6 +151,104 @@ Audio Input (16kHz)
     Text Output
 ```
 ## Limitations
 - Optimized for audio up to 30 seconds

 # Borealis-5B-IT
+Borealis is an audio-language model that combines Whisper encoder with Qwen3-4B LLM for speech understanding and instruction-following tasks.
 ## Model Description
     Text Output
 ```
+## vLLM Support
+Borealis can be accelerated using vLLM for the text generation backbone. Since Borealis uses custom audio processing (Whisper encoder + adapter), we provide a hybrid approach.
+### Install vLLM
+```bash
+pip install vllm>=0.6.0
+```
+### Option 1: Text-only with vLLM (Qwen3-4B backbone)
+If you've already processed audio to text (e.g., via ASR), you can use vLLM directly with the Qwen3 backbone:
+```python
+from vllm import LLM, SamplingParams
+llm = LLM(
+    model="Qwen/Qwen3-4B",
+    dtype="bfloat16",
+    gpu_memory_utilization=0.8,
+)
+prompt = """<|im_start|>system
+You are a helpful voice assistant.<|im_end|>
+<|im_start|>user
+[Transcribed text from audio goes here]<|im_end|>
+<|im_start|>assistant
+"""
+sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
+outputs = llm.generate([prompt], sampling_params)
+print(outputs[0].outputs[0].text)
+```
+### Option 2: Hybrid Inference (HF Audio + vLLM Text)
+For maximum performance, use HuggingFace for audio encoding and vLLM for text generation:
+```python
+import torch
+import torchaudio
+from transformers import AutoModel
+from vllm import LLM, SamplingParams
+# Step 1: Load Borealis for audio encoding
+borealis = AutoModel.from_pretrained(
+    "Vikhrmodels/Borealis-5b-it",
+    trust_remote_code=True,
+    device="cuda"
+)
+borealis.eval()
+# Step 2: Load vLLM for text generation
+vllm_model = LLM(
+    model="Qwen/Qwen3-4B",
+    dtype="bfloat16",
+    gpu_memory_utilization=0.5,
+)
+# Step 3: Encode audio with Borealis
+audio, sr = torchaudio.load("audio.wav")
+if sr != 16000:
+    audio = torchaudio.functional.resample(audio, sr, 16000)
+audio = audio.squeeze()
+with torch.inference_mode():
+    # Get audio transcription/understanding from Borealis
+    output_ids = borealis.generate(
+        audio=audio,
+        user_prompt="Transcribe: <|start_of_audio|><|end_of_audio|>",
+        system_prompt="You are a speech recognition assistant.",
+        max_new_tokens=128,
+    )
+    transcription = borealis.decode(output_ids[0])
+# Step 4: Use vLLM for fast follow-up generation
+prompt = f"""<|im_start|>system
+You are a helpful assistant.<|im_end|>
+<|im_start|>user
+Based on this audio transcription: "{transcription}"
+Please provide a detailed summary.<|im_end|>
+<|im_start|>assistant
+"""
+sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
+outputs = vllm_model.generate([prompt], sampling_params)
+print(outputs[0].outputs[0].text)
+```
+### Benchmark Results
+| Method | Throughput | Notes |
+|--------|------------|-------|
+| Native HF (Borealis) | 32.6 tok/s | Full audio-to-text pipeline |
+| vLLM (Qwen3-4B) | 201.4 tok/s | Text-only, 6.18x faster |
+| Hybrid | ~150 tok/s | Audio encoding + vLLM text gen |
 ## Limitations
 - Optimized for audio up to 30 seconds