Vikhrmodels
/

Borealis-5b-it

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 🌌 Borealis-5B-IT\n",
+    "\n",
+    "## Audio-Language Model for Speech Understanding\n",
+    "\n",
+    "Borealis combines **Whisper Large V3** encoder with **Qwen3-4B** LLM to understand and respond to audio input.\n",
+    "\n",
+    "| Component | Model | Parameters |\n",
+    "|-----------|-------|------------|\n",
+    "| Audio Encoder | Whisper Large V3 | ~600M (frozen) |\n",
+    "| Language Model | Qwen3-4B | ~4B (fine-tuned) |\n",
+    "| Adapter | 2-layer MLP | ~13M |\n",
+    "| **Total** | | **~5B** |\n",
+    "\n",
+    "**Languages**: Russian, English\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 📦 Installation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install dependencies (uncomment if needed)\n",
+    "# !pip install torch torchaudio transformers safetensors datasets soundfile"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 🚀 Load Model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "os.environ[\"HF_AUDIO_DECODER_BACKEND\"] = \"soundfile\"\n",
+    "\n",
+    "import torch\n",
+    "from transformers import AutoModel\n",
+    "\n",
+    "# Load model (requires ~20GB RAM for CPU, ~12GB VRAM for GPU)\n",
+    "DEVICE = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
+    "print(f\"Using device: {DEVICE}\")\n",
+    "\n",
+    "model = AutoModel.from_pretrained(\n",
+    "    \"Vikhrmodels/Borealis-5b-it\",\n",
+    "    trust_remote_code=True,\n",
+    "    device=DEVICE\n",
+    ")\n",
+    "model.eval()\n",
+    "print(\"✅ Model loaded!\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 🎵 Load Audio\n",
+    "\n",
+    "You can load audio from:\n",
+    "- Local file\n",
+    "- URL\n",
+    "- HuggingFace dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torchaudio\n",
+    "from IPython.display import Audio, display\n",
+    "\n",
+    "# Option 1: Load from HuggingFace dataset\n",
+    "from datasets import load_dataset, Audio as DatasetAudio\n",
+    "\n",
+    "ds = load_dataset(\"Vikhrmodels/Speech-Instructions\", split=\"train\", streaming=True)\n",
+    "ds = ds.cast_column(\"audio\", DatasetAudio(sampling_rate=16000))\n",
+    "\n",
+    "# Get a sample\n",
+    "sample = next(iter(ds))\n",
+    "audio_array = torch.tensor(sample[\"audio\"][\"array\"]).float()\n",
+    "sr = sample[\"audio\"][\"sampling_rate\"]\n",
+    "\n",
+    "print(f\"📊 Audio shape: {audio_array.shape}\")\n",
+    "print(f\"📊 Sample rate: {sr} Hz\")\n",
+    "print(f\"📊 Duration: {len(audio_array) / sr:.2f} seconds\")\n",
+    "print(f\"\\n📝 Original question: {sample['question']}\")\n",
+    "print(f\"📝 Original answer: {sample['answer'][:300]}...\")\n",
+    "\n",
+    "# Play audio\n",
+    "display(Audio(audio_array.numpy(), rate=sr))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Option 2: Load from local file (uncomment to use)\n",
+    "# audio_array, sr = torchaudio.load(\"your_audio.wav\")\n",
+    "# if sr != 16000:\n",
+    "#     audio_array = torchaudio.functional.resample(audio_array, sr, 16000)\n",
+    "#     sr = 16000\n",
+    "# audio_array = audio_array.squeeze()  # Remove channel dim if mono"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 💬 Generate Response\n",
+    "\n",
+    "### Basic Usage"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with torch.inference_mode():\n",
+    "    output = model.generate(\n",
+    "        audio=audio_array,\n",
+    "        user_prompt=\"What is being said in this audio? <|start_of_audio|><|end_of_audio|>\",\n",
+    "        system_prompt=\"You are a helpful voice assistant.\",\n",
+    "        max_new_tokens=256,\n",
+    "        temperature=0.7,\n",
+    "        top_p=0.9,\n",
+    "    )\n",
+    "\n",
+    "response = model.decode(output[0])\n",
+    "print(\"🤖 Model response:\")\n",
+    "print(response)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## 📚 Prompt Examples\n",
+    "\n",
+    "### 🎯 Transcription"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with torch.inference_mode():\n",
+    "    output = model.generate(\n",
+    "        audio=audio_array,\n",
+    "        user_prompt=\"Transcribe this audio accurately: <|start_of_audio|><|end_of_audio|>\",\n",
+    "        system_prompt=\"You are a speech recognition assistant. Transcribe audio to text accurately.\",\n",
+    "        max_new_tokens=512,\n",
+    "        temperature=0.3,  # Lower temperature for more accurate transcription\n",
+    "    )\n",
+    "\n",
+    "print(\"📝 Transcription:\")\n",
+    "print(model.decode(output[0]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 📋 Summarization"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with torch.inference_mode():\n",
+    "    output = model.generate(\n",
+    "        audio=audio_array,\n",
+    "        user_prompt=\"Summarize the main points of this audio: <|start_of_audio|><|end_of_audio|>\",\n",
+    "        system_prompt=\"You are a helpful assistant. Provide concise summaries.\",\n",
+    "        max_new_tokens=256,\n",
+    "        temperature=0.7,\n",
+    "    )\n",
+    "\n",
+    "print(\"📋 Summary:\")\n",
+    "print(model.decode(output[0]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 🇷🇺 Russian Prompts"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with torch.inference_mode():\n",
+    "    output = model.generate(\n",
+    "        audio=audio_array,\n",
+    "        user_prompt=\"О чём говорится в этой аудиозаписи? <|start_of_audio|><|end_of_audio|>\",\n",
+    "        system_prompt=\"Ты полезный голосовой ассистент. Отвечай на русском языке.\",\n",
+    "        max_new_tokens=256,\n",
+    "        temperature=0.7,\n",
+    "    )\n",
+    "\n",
+    "print(\"🇷🇺 Ответ на русском:\")\n",
+    "print(model.decode(output[0]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 🎭 Audio Description"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with torch.inference_mode():\n",
+    "    output = model.generate(\n",
+    "        audio=audio_array,\n",
+    "        user_prompt=\"Describe in detail what you hear, including tone, emotion, and content: <|start_of_audio|><|end_of_audio|>\",\n",
+    "        system_prompt=\"You are an expert audio analyst. Provide detailed descriptions.\",\n",
+    "        max_new_tokens=512,\n",
+    "        temperature=0.8,\n",
+    "    )\n",
+    "\n",
+    "print(\"🎭 Detailed description:\")\n",
+    "print(model.decode(output[0]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## ⚙️ Generation Parameters\n",
+    "\n",
+    "| Parameter | Description | Recommended |\n",
+    "|-----------|-------------|-------------|\n",
+    "| `max_new_tokens` | Maximum tokens to generate | 128-512 |\n",
+    "| `temperature` | Randomness (0=deterministic, 1+=creative) | 0.3-0.8 |\n",
+    "| `top_p` | Nucleus sampling threshold | 0.9 |\n",
+    "| `do_sample` | Enable sampling (auto-set based on temp) | True |"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Experiment with different parameters\n",
+    "def generate_with_params(audio, prompt, temp=0.7, max_tokens=256):\n",
+    "    with torch.inference_mode():\n",
+    "        output = model.generate(\n",
+    "            audio=audio,\n",
+    "            user_prompt=f\"{prompt} <|start_of_audio|><|end_of_audio|>\",\n",
+    "            system_prompt=\"You are a helpful voice assistant.\",\n",
+    "            max_new_tokens=max_tokens,\n",
+    "            temperature=temp,\n",
+    "            top_p=0.9,\n",
+    "        )\n",
+    "    return model.decode(output[0])\n",
+    "\n",
+    "# Compare different temperatures\n",
+    "print(\"🌡️ Temperature = 0.3 (more focused):\")\n",
+    "print(generate_with_params(audio_array, \"What is this audio about?\", temp=0.3))\n",
+    "print(\"\\n\" + \"=\"*50 + \"\\n\")\n",
+    "print(\"🌡️ Temperature = 0.9 (more creative):\")\n",
+    "print(generate_with_params(audio_array, \"What is this audio about?\", temp=0.9))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## 🎤 Record Your Own Audio\n",
+    "\n",
+    "Use Gradio to record and test with your own voice:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import gradio as gr\n",
+    "\n",
+    "def process_audio(audio, system_prompt, user_prompt, max_tokens, temperature):\n",
+    "    if audio is None:\n",
+    "        return \"Please record or upload audio.\"\n",
+    "    \n",
+    "    sr, audio_array = audio\n",
+    "    audio_tensor = torch.tensor(audio_array).float()\n",
+    "    \n",
+    "    if audio_tensor.dim() > 1:\n",
+    "        audio_tensor = audio_tensor.mean(dim=-1)\n",
+    "    if audio_tensor.abs().max() > 1.0:\n",
+    "        audio_tensor = audio_tensor / 32768.0\n",
+    "    if sr != 16000:\n",
+    "        audio_tensor = torchaudio.functional.resample(audio_tensor, sr, 16000)\n",
+    "    \n",
+    "    if \"<|start_of_audio|>\" not in user_prompt:\n",
+    "        user_prompt = f\"{user_prompt} <|start_of_audio|><|end_of_audio|>\"\n",
+    "    \n",
+    "    with torch.inference_mode():\n",
+    "        output = model.generate(\n",
+    "            audio=audio_tensor,\n",
+    "            system_prompt=system_prompt,\n",
+    "            user_prompt=user_prompt,\n",
+    "            max_new_tokens=int(max_tokens),\n",
+    "            temperature=temperature,\n",
+    "        )\n",
+    "    \n",
+    "    return model.decode(output[0])\n",
+    "\n",
+    "demo = gr.Interface(\n",
+    "    fn=process_audio,\n",
+    "    inputs=[\n",
+    "        gr.Audio(sources=[\"microphone\", \"upload\"], type=\"numpy\", label=\"Audio\"),\n",
+    "        gr.Textbox(value=\"You are a helpful voice assistant.\", label=\"System Prompt\"),\n",
+    "        gr.Textbox(value=\"What is being said? <|start_of_audio|><|end_of_audio|>\", label=\"User Prompt\"),\n",
+    "        gr.Slider(64, 512, value=256, step=64, label=\"Max Tokens\"),\n",
+    "        gr.Slider(0.1, 1.5, value=0.7, step=0.1, label=\"Temperature\"),\n",
+    "    ],\n",
+    "    outputs=gr.Textbox(label=\"Response\", lines=10),\n",
+    "    title=\"🌌 Borealis Audio Chat\",\n",
+    "    description=\"Record or upload audio and chat with Borealis!\",\n",
+    ")\n",
+    "\n",
+    "demo.launch(inline=True, height=600)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## 📊 Training Data\n",
+    "\n",
+    "Borealis was fine-tuned on:\n",
+    "\n",
+    "| Dataset | Description | Link |\n",
+    "|---------|-------------|------|\n",
+    "| Speech-Instructions | General speech instruction-following | [🔗](https://huggingface.co/datasets/Vikhrmodels/Speech-Instructions) |\n",
+    "| Speech-Describe | Audio description tasks | [🔗](https://huggingface.co/datasets/Vikhrmodels/Speech-Describe) |\n",
+    "| ToneBooks | Russian audiobook excerpts | [🔗](https://huggingface.co/datasets/Vikhrmodels/ToneBooks) |\n",
+    "| AudioBooksInstructGemini2.5 | Gemini-generated instructions | [🔗](https://huggingface.co/datasets/Vikhrmodels/AudioBooksInstructGemini2.5) |\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## 📎 Links\n",
+    "\n",
+    "- **Model**: [Vikhrmodels/Borealis-5b-it](https://huggingface.co/Vikhrmodels/Borealis-5b-it)\n",
+    "- **Demo Space**: [Vikhrmodels/Borealis-inference](https://huggingface.co/spaces/Vikhrmodels/Borealis-inference)\n",
+    "- **GitHub**: [VikhrModels/Borealis](https://github.com/VikhrModels/Borealis)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}