AlexWortega commited on
Commit
40cc687
·
verified ·
1 Parent(s): 83a9bad

Upload Borealis_Demo.ipynb with huggingface_hub

Browse files
Files changed (1) hide show
  1. Borealis_Demo.ipynb +415 -0
Borealis_Demo.ipynb ADDED
@@ -0,0 +1,415 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# 🌌 Borealis-5B-IT\n",
8
+ "\n",
9
+ "## Audio-Language Model for Speech Understanding\n",
10
+ "\n",
11
+ "Borealis combines **Whisper Large V3** encoder with **Qwen3-4B** LLM to understand and respond to audio input.\n",
12
+ "\n",
13
+ "| Component | Model | Parameters |\n",
14
+ "|-----------|-------|------------|\n",
15
+ "| Audio Encoder | Whisper Large V3 | ~600M (frozen) |\n",
16
+ "| Language Model | Qwen3-4B | ~4B (fine-tuned) |\n",
17
+ "| Adapter | 2-layer MLP | ~13M |\n",
18
+ "| **Total** | | **~5B** |\n",
19
+ "\n",
20
+ "**Languages**: Russian, English\n",
21
+ "\n",
22
+ "---"
23
+ ]
24
+ },
25
+ {
26
+ "cell_type": "markdown",
27
+ "metadata": {},
28
+ "source": [
29
+ "## 📦 Installation"
30
+ ]
31
+ },
32
+ {
33
+ "cell_type": "code",
34
+ "execution_count": null,
35
+ "metadata": {},
36
+ "outputs": [],
37
+ "source": [
38
+ "# Install dependencies (uncomment if needed)\n",
39
+ "# !pip install torch torchaudio transformers safetensors datasets soundfile"
40
+ ]
41
+ },
42
+ {
43
+ "cell_type": "markdown",
44
+ "metadata": {},
45
+ "source": [
46
+ "## 🚀 Load Model"
47
+ ]
48
+ },
49
+ {
50
+ "cell_type": "code",
51
+ "execution_count": null,
52
+ "metadata": {},
53
+ "outputs": [],
54
+ "source": [
55
+ "import os\n",
56
+ "os.environ[\"HF_AUDIO_DECODER_BACKEND\"] = \"soundfile\"\n",
57
+ "\n",
58
+ "import torch\n",
59
+ "from transformers import AutoModel\n",
60
+ "\n",
61
+ "# Load model (requires ~20GB RAM for CPU, ~12GB VRAM for GPU)\n",
62
+ "DEVICE = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
63
+ "print(f\"Using device: {DEVICE}\")\n",
64
+ "\n",
65
+ "model = AutoModel.from_pretrained(\n",
66
+ " \"Vikhrmodels/Borealis-5b-it\",\n",
67
+ " trust_remote_code=True,\n",
68
+ " device=DEVICE\n",
69
+ ")\n",
70
+ "model.eval()\n",
71
+ "print(\"✅ Model loaded!\")"
72
+ ]
73
+ },
74
+ {
75
+ "cell_type": "markdown",
76
+ "metadata": {},
77
+ "source": [
78
+ "## 🎵 Load Audio\n",
79
+ "\n",
80
+ "You can load audio from:\n",
81
+ "- Local file\n",
82
+ "- URL\n",
83
+ "- HuggingFace dataset"
84
+ ]
85
+ },
86
+ {
87
+ "cell_type": "code",
88
+ "execution_count": null,
89
+ "metadata": {},
90
+ "outputs": [],
91
+ "source": [
92
+ "import torchaudio\n",
93
+ "from IPython.display import Audio, display\n",
94
+ "\n",
95
+ "# Option 1: Load from HuggingFace dataset\n",
96
+ "from datasets import load_dataset, Audio as DatasetAudio\n",
97
+ "\n",
98
+ "ds = load_dataset(\"Vikhrmodels/Speech-Instructions\", split=\"train\", streaming=True)\n",
99
+ "ds = ds.cast_column(\"audio\", DatasetAudio(sampling_rate=16000))\n",
100
+ "\n",
101
+ "# Get a sample\n",
102
+ "sample = next(iter(ds))\n",
103
+ "audio_array = torch.tensor(sample[\"audio\"][\"array\"]).float()\n",
104
+ "sr = sample[\"audio\"][\"sampling_rate\"]\n",
105
+ "\n",
106
+ "print(f\"📊 Audio shape: {audio_array.shape}\")\n",
107
+ "print(f\"📊 Sample rate: {sr} Hz\")\n",
108
+ "print(f\"📊 Duration: {len(audio_array) / sr:.2f} seconds\")\n",
109
+ "print(f\"\\n📝 Original question: {sample['question']}\")\n",
110
+ "print(f\"📝 Original answer: {sample['answer'][:300]}...\")\n",
111
+ "\n",
112
+ "# Play audio\n",
113
+ "display(Audio(audio_array.numpy(), rate=sr))"
114
+ ]
115
+ },
116
+ {
117
+ "cell_type": "code",
118
+ "execution_count": null,
119
+ "metadata": {},
120
+ "outputs": [],
121
+ "source": [
122
+ "# Option 2: Load from local file (uncomment to use)\n",
123
+ "# audio_array, sr = torchaudio.load(\"your_audio.wav\")\n",
124
+ "# if sr != 16000:\n",
125
+ "# audio_array = torchaudio.functional.resample(audio_array, sr, 16000)\n",
126
+ "# sr = 16000\n",
127
+ "# audio_array = audio_array.squeeze() # Remove channel dim if mono"
128
+ ]
129
+ },
130
+ {
131
+ "cell_type": "markdown",
132
+ "metadata": {},
133
+ "source": [
134
+ "## 💬 Generate Response\n",
135
+ "\n",
136
+ "### Basic Usage"
137
+ ]
138
+ },
139
+ {
140
+ "cell_type": "code",
141
+ "execution_count": null,
142
+ "metadata": {},
143
+ "outputs": [],
144
+ "source": [
145
+ "with torch.inference_mode():\n",
146
+ " output = model.generate(\n",
147
+ " audio=audio_array,\n",
148
+ " user_prompt=\"What is being said in this audio? <|start_of_audio|><|end_of_audio|>\",\n",
149
+ " system_prompt=\"You are a helpful voice assistant.\",\n",
150
+ " max_new_tokens=256,\n",
151
+ " temperature=0.7,\n",
152
+ " top_p=0.9,\n",
153
+ " )\n",
154
+ "\n",
155
+ "response = model.decode(output[0])\n",
156
+ "print(\"🤖 Model response:\")\n",
157
+ "print(response)"
158
+ ]
159
+ },
160
+ {
161
+ "cell_type": "markdown",
162
+ "metadata": {},
163
+ "source": [
164
+ "---\n",
165
+ "\n",
166
+ "## 📚 Prompt Examples\n",
167
+ "\n",
168
+ "### 🎯 Transcription"
169
+ ]
170
+ },
171
+ {
172
+ "cell_type": "code",
173
+ "execution_count": null,
174
+ "metadata": {},
175
+ "outputs": [],
176
+ "source": [
177
+ "with torch.inference_mode():\n",
178
+ " output = model.generate(\n",
179
+ " audio=audio_array,\n",
180
+ " user_prompt=\"Transcribe this audio accurately: <|start_of_audio|><|end_of_audio|>\",\n",
181
+ " system_prompt=\"You are a speech recognition assistant. Transcribe audio to text accurately.\",\n",
182
+ " max_new_tokens=512,\n",
183
+ " temperature=0.3, # Lower temperature for more accurate transcription\n",
184
+ " )\n",
185
+ "\n",
186
+ "print(\"📝 Transcription:\")\n",
187
+ "print(model.decode(output[0]))"
188
+ ]
189
+ },
190
+ {
191
+ "cell_type": "markdown",
192
+ "metadata": {},
193
+ "source": [
194
+ "### 📋 Summarization"
195
+ ]
196
+ },
197
+ {
198
+ "cell_type": "code",
199
+ "execution_count": null,
200
+ "metadata": {},
201
+ "outputs": [],
202
+ "source": [
203
+ "with torch.inference_mode():\n",
204
+ " output = model.generate(\n",
205
+ " audio=audio_array,\n",
206
+ " user_prompt=\"Summarize the main points of this audio: <|start_of_audio|><|end_of_audio|>\",\n",
207
+ " system_prompt=\"You are a helpful assistant. Provide concise summaries.\",\n",
208
+ " max_new_tokens=256,\n",
209
+ " temperature=0.7,\n",
210
+ " )\n",
211
+ "\n",
212
+ "print(\"📋 Summary:\")\n",
213
+ "print(model.decode(output[0]))"
214
+ ]
215
+ },
216
+ {
217
+ "cell_type": "markdown",
218
+ "metadata": {},
219
+ "source": [
220
+ "### 🇷🇺 Russian Prompts"
221
+ ]
222
+ },
223
+ {
224
+ "cell_type": "code",
225
+ "execution_count": null,
226
+ "metadata": {},
227
+ "outputs": [],
228
+ "source": [
229
+ "with torch.inference_mode():\n",
230
+ " output = model.generate(\n",
231
+ " audio=audio_array,\n",
232
+ " user_prompt=\"О чём говорится в этой аудиозаписи? <|start_of_audio|><|end_of_audio|>\",\n",
233
+ " system_prompt=\"Ты полезный голосовой ассистент. Отвечай на русском языке.\",\n",
234
+ " max_new_tokens=256,\n",
235
+ " temperature=0.7,\n",
236
+ " )\n",
237
+ "\n",
238
+ "print(\"🇷🇺 Ответ на русском:\")\n",
239
+ "print(model.decode(output[0]))"
240
+ ]
241
+ },
242
+ {
243
+ "cell_type": "markdown",
244
+ "metadata": {},
245
+ "source": [
246
+ "### 🎭 Audio Description"
247
+ ]
248
+ },
249
+ {
250
+ "cell_type": "code",
251
+ "execution_count": null,
252
+ "metadata": {},
253
+ "outputs": [],
254
+ "source": [
255
+ "with torch.inference_mode():\n",
256
+ " output = model.generate(\n",
257
+ " audio=audio_array,\n",
258
+ " user_prompt=\"Describe in detail what you hear, including tone, emotion, and content: <|start_of_audio|><|end_of_audio|>\",\n",
259
+ " system_prompt=\"You are an expert audio analyst. Provide detailed descriptions.\",\n",
260
+ " max_new_tokens=512,\n",
261
+ " temperature=0.8,\n",
262
+ " )\n",
263
+ "\n",
264
+ "print(\"🎭 Detailed description:\")\n",
265
+ "print(model.decode(output[0]))"
266
+ ]
267
+ },
268
+ {
269
+ "cell_type": "markdown",
270
+ "metadata": {},
271
+ "source": [
272
+ "---\n",
273
+ "\n",
274
+ "## ⚙️ Generation Parameters\n",
275
+ "\n",
276
+ "| Parameter | Description | Recommended |\n",
277
+ "|-----------|-------------|-------------|\n",
278
+ "| `max_new_tokens` | Maximum tokens to generate | 128-512 |\n",
279
+ "| `temperature` | Randomness (0=deterministic, 1+=creative) | 0.3-0.8 |\n",
280
+ "| `top_p` | Nucleus sampling threshold | 0.9 |\n",
281
+ "| `do_sample` | Enable sampling (auto-set based on temp) | True |"
282
+ ]
283
+ },
284
+ {
285
+ "cell_type": "code",
286
+ "execution_count": null,
287
+ "metadata": {},
288
+ "outputs": [],
289
+ "source": [
290
+ "# Experiment with different parameters\n",
291
+ "def generate_with_params(audio, prompt, temp=0.7, max_tokens=256):\n",
292
+ " with torch.inference_mode():\n",
293
+ " output = model.generate(\n",
294
+ " audio=audio,\n",
295
+ " user_prompt=f\"{prompt} <|start_of_audio|><|end_of_audio|>\",\n",
296
+ " system_prompt=\"You are a helpful voice assistant.\",\n",
297
+ " max_new_tokens=max_tokens,\n",
298
+ " temperature=temp,\n",
299
+ " top_p=0.9,\n",
300
+ " )\n",
301
+ " return model.decode(output[0])\n",
302
+ "\n",
303
+ "# Compare different temperatures\n",
304
+ "print(\"🌡️ Temperature = 0.3 (more focused):\")\n",
305
+ "print(generate_with_params(audio_array, \"What is this audio about?\", temp=0.3))\n",
306
+ "print(\"\\n\" + \"=\"*50 + \"\\n\")\n",
307
+ "print(\"🌡️ Temperature = 0.9 (more creative):\")\n",
308
+ "print(generate_with_params(audio_array, \"What is this audio about?\", temp=0.9))"
309
+ ]
310
+ },
311
+ {
312
+ "cell_type": "markdown",
313
+ "metadata": {},
314
+ "source": [
315
+ "---\n",
316
+ "\n",
317
+ "## 🎤 Record Your Own Audio\n",
318
+ "\n",
319
+ "Use Gradio to record and test with your own voice:"
320
+ ]
321
+ },
322
+ {
323
+ "cell_type": "code",
324
+ "execution_count": null,
325
+ "metadata": {},
326
+ "outputs": [],
327
+ "source": [
328
+ "import gradio as gr\n",
329
+ "\n",
330
+ "def process_audio(audio, system_prompt, user_prompt, max_tokens, temperature):\n",
331
+ " if audio is None:\n",
332
+ " return \"Please record or upload audio.\"\n",
333
+ " \n",
334
+ " sr, audio_array = audio\n",
335
+ " audio_tensor = torch.tensor(audio_array).float()\n",
336
+ " \n",
337
+ " if audio_tensor.dim() > 1:\n",
338
+ " audio_tensor = audio_tensor.mean(dim=-1)\n",
339
+ " if audio_tensor.abs().max() > 1.0:\n",
340
+ " audio_tensor = audio_tensor / 32768.0\n",
341
+ " if sr != 16000:\n",
342
+ " audio_tensor = torchaudio.functional.resample(audio_tensor, sr, 16000)\n",
343
+ " \n",
344
+ " if \"<|start_of_audio|>\" not in user_prompt:\n",
345
+ " user_prompt = f\"{user_prompt} <|start_of_audio|><|end_of_audio|>\"\n",
346
+ " \n",
347
+ " with torch.inference_mode():\n",
348
+ " output = model.generate(\n",
349
+ " audio=audio_tensor,\n",
350
+ " system_prompt=system_prompt,\n",
351
+ " user_prompt=user_prompt,\n",
352
+ " max_new_tokens=int(max_tokens),\n",
353
+ " temperature=temperature,\n",
354
+ " )\n",
355
+ " \n",
356
+ " return model.decode(output[0])\n",
357
+ "\n",
358
+ "demo = gr.Interface(\n",
359
+ " fn=process_audio,\n",
360
+ " inputs=[\n",
361
+ " gr.Audio(sources=[\"microphone\", \"upload\"], type=\"numpy\", label=\"Audio\"),\n",
362
+ " gr.Textbox(value=\"You are a helpful voice assistant.\", label=\"System Prompt\"),\n",
363
+ " gr.Textbox(value=\"What is being said? <|start_of_audio|><|end_of_audio|>\", label=\"User Prompt\"),\n",
364
+ " gr.Slider(64, 512, value=256, step=64, label=\"Max Tokens\"),\n",
365
+ " gr.Slider(0.1, 1.5, value=0.7, step=0.1, label=\"Temperature\"),\n",
366
+ " ],\n",
367
+ " outputs=gr.Textbox(label=\"Response\", lines=10),\n",
368
+ " title=\"🌌 Borealis Audio Chat\",\n",
369
+ " description=\"Record or upload audio and chat with Borealis!\",\n",
370
+ ")\n",
371
+ "\n",
372
+ "demo.launch(inline=True, height=600)"
373
+ ]
374
+ },
375
+ {
376
+ "cell_type": "markdown",
377
+ "metadata": {},
378
+ "source": [
379
+ "---\n",
380
+ "\n",
381
+ "## 📊 Training Data\n",
382
+ "\n",
383
+ "Borealis was fine-tuned on:\n",
384
+ "\n",
385
+ "| Dataset | Description | Link |\n",
386
+ "|---------|-------------|------|\n",
387
+ "| Speech-Instructions | General speech instruction-following | [🔗](https://huggingface.co/datasets/Vikhrmodels/Speech-Instructions) |\n",
388
+ "| Speech-Describe | Audio description tasks | [🔗](https://huggingface.co/datasets/Vikhrmodels/Speech-Describe) |\n",
389
+ "| ToneBooks | Russian audiobook excerpts | [🔗](https://huggingface.co/datasets/Vikhrmodels/ToneBooks) |\n",
390
+ "| AudioBooksInstructGemini2.5 | Gemini-generated instructions | [🔗](https://huggingface.co/datasets/Vikhrmodels/AudioBooksInstructGemini2.5) |\n",
391
+ "\n",
392
+ "---\n",
393
+ "\n",
394
+ "## 📎 Links\n",
395
+ "\n",
396
+ "- **Model**: [Vikhrmodels/Borealis-5b-it](https://huggingface.co/Vikhrmodels/Borealis-5b-it)\n",
397
+ "- **Demo Space**: [Vikhrmodels/Borealis-inference](https://huggingface.co/spaces/Vikhrmodels/Borealis-inference)\n",
398
+ "- **GitHub**: [VikhrModels/Borealis](https://github.com/VikhrModels/Borealis)"
399
+ ]
400
+ }
401
+ ],
402
+ "metadata": {
403
+ "kernelspec": {
404
+ "display_name": "Python 3",
405
+ "language": "python",
406
+ "name": "python3"
407
+ },
408
+ "language_info": {
409
+ "name": "python",
410
+ "version": "3.10.0"
411
+ }
412
+ },
413
+ "nbformat": 4,
414
+ "nbformat_minor": 4
415
+ }