remiai3 commited on
Commit
e241d8a
·
verified ·
1 Parent(s): d905308

Upload 8 files

Browse files
co-lab/README.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🧑‍🎓 Free AI LLMs for Students – Colab Ready 🚀
2
+
3
+ Welcome to **RemiAI3’s Student AI Resources** 🎓
4
+ This repository provides **ready-to-use Colab notebooks** for open-source Large Language Models (LLMs).
5
+ Our goal is to make AI research **free, simple, and accessible** for students — no paid GPUs, no hidden costs, just open-source power.
6
+
7
+ ---
8
+
9
+ ## 📌 About this Repository
10
+ - ✅ 100% Free – Runs on Google Colab’s free GPUs (T4 / A100).
11
+ - ✅ No paid subscriptions required.
12
+ - ✅ Hugging Face hosted models – secure & reliable.
13
+ - ✅ Focus on **low-size (<10GB)** models so students can run them easily.
14
+ - ✅ Preconfigured **GGUF quantized models** for faster inference.
15
+
16
+ ---
17
+
18
+ ## 🔥 Models Included
19
+ We provide notebooks for some of the **best open-source LLMs** under 10GB (quantized):
20
+
21
+ | Model | Purpose | Notes |
22
+ |-------|---------|-------|
23
+ | **Mistral 7B Instruct** | General reasoning & chat | Strong all-rounder |
24
+ | **Zephyr 7B Beta** | Human-like conversation | Polite, chatty, like ChatGPT |
25
+ | **OpenHermes 2.5 (Mistral)** | Instruction-tuned Q&A | Good balance of tasks |
26
+ | **CodeLlama 7B Instruct** | Coding & technical help | Python, JS, C++, etc. |
27
+ | **DeepSeek Coder 6.7B** | Coding + chat | Lightweight coder |
28
+ | **TinyLlama 1.1B Chat** | Ultra-lightweight model | Runs even on CPU |
29
+ | **Llama-2 13B Chat** | Stronger reasoning | Needs Colab A100 for smooth run |
30
+
31
+ ---
32
+
33
+ ## 🚀 How to Use in Colab
34
+
35
+ 1. **Open a Notebook** (see links below).
36
+ 2. In Colab → go to **Runtime > Change runtime type > GPU (T4)**.
37
+ - (If available, choose **A100** for larger models like Llama-2 13B).
38
+ 3. Run cells **top to bottom**.
39
+ 4. If asked for Hugging Face token (for gated models), [create a free token](https://huggingface.co/settings/tokens) and paste it in the login cell.
40
+ 5. Start chatting in the interactive UI provided in the notebook.
41
+
42
+ ---
43
+
44
+ ## 📒 Colab Notebooks
45
+
46
+ - [Mistral 7B Instruct](./mistral_7b_instruct_gguf,_q4_k_m.ipynb)
47
+ - [Zephyr 7B Beta](./zephyr_7b_beta_gguf,_q4_k_m.ipynb)
48
+ - [OpenHermes 2.5](./openhermes_2.5___mistral_7b_gguf,_q4_k_m.ipynb)
49
+ - [CodeLlama 7B Instruct](./codellama_7b_instruct_gguf,_q4_k_m.ipynb)
50
+ - [DeepSeek Coder 6.7B](./deepseek_coder_6.7b_instruct_gguf,_q4_k_m.ipynb)
51
+ - [TinyLlama 1.1B Chat](./tinyllama_1.1b_chat_gguf,_q4_k_m.ipynb)
52
+ - [Llama-2 13B Chat](./llama_2_13b_chat_gguf,_q4_k_m.ipynb)
53
+
54
+ ---
55
+
56
+ ## ✨ Why This Repo?
57
+ Many students don’t have access to high-end GPUs or paid services.
58
+ This repo ensures **free access to AI** for research and learning.
59
+ - Models are **optimized for Colab’s free GPUs**.
60
+ - Easy **one-click notebooks** to start experiments.
61
+ - No hidden costs, no lock-ins.
62
+
63
+ ---
64
+
65
+ ## ⚠️ Limitations
66
+ - Free Colab sessions reset after ~12 hours. Save your work!
67
+ - Some models (like Llama-2 13B) require A100, not T4.
68
+ - Internet connection is required to download models initially.
69
+
70
+ ---
71
+
72
+ ## ❤️ Contributing
73
+ This is a community-first project.
74
+ If you want to add more Colab notebooks for AI models (OCR, T2V, Q&A, etc.), feel free to contribute!
75
+
76
+ ---
77
+
78
+ 📌 **Maintainer:** [RemiAI3 on Hugging Face](https://huggingface.co/remiai3)
co-lab/codellama_7b_instruct_gguf,_q4_k_m.ipynb ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "0d8941da",
6
+ "metadata": {},
7
+ "source": [
8
+ "# CodeLlama 7B Instruct (GGUF, Q4_K_M)\n",
9
+ "**One-click Colab notebook** to run `TheBloke/CodeLlama-7B-Instruct-GGUF` (file: `codellama-7b-instruct.Q4_K_M.gguf`) using **llama-cpp-python** on GPU (T4/A100) or CPU.\n",
10
+ "\n",
11
+ "**Features**: Hugging Face login, GGUF download, fast GPU inference, chat UI cell, optional API server.\n",
12
+ "\n",
13
+ "Best for coding tasks. Try temperature=0.2–0.4 for deterministic code.\n",
14
+ "---\n",
15
+ "**Tip:** If you're on free Colab, switch to a GPU runtime: **Runtime → Change runtime type → T4 GPU**.\n"
16
+ ]
17
+ },
18
+ {
19
+ "cell_type": "code",
20
+ "execution_count": null,
21
+ "id": "69426b1a",
22
+ "metadata": {},
23
+ "outputs": [],
24
+ "source": [
25
+ "#@title 🔧 Check GPU and Python version\n",
26
+ "!nvidia-smi || echo \"No NVIDIA GPU available\"\n",
27
+ "!python --version"
28
+ ]
29
+ },
30
+ {
31
+ "cell_type": "code",
32
+ "execution_count": null,
33
+ "id": "fdf8f51c",
34
+ "metadata": {},
35
+ "outputs": [],
36
+ "source": [
37
+ "#@title ⬇️ Install dependencies (GPU build if available)\n",
38
+ "# If you get build errors, re-run this cell.\n",
39
+ "import os, sys, subprocess\n",
40
+ "\n",
41
+ "cuda_spec = \"cu121\"\n",
42
+ "wheels_index = \"https://abetlen.github.io/llama-cpp-python/whl/\" + cuda_spec\n",
43
+ "try:\n",
44
+ " # Try GPU wheel first\n",
45
+ " exitcode = subprocess.call([sys.executable, \"-m\", \"pip\", \"install\", \"-qU\",\n",
46
+ " f\"--extra-index-url={wheels_index}\", \"llama-cpp-python>=0.2.90\",\n",
47
+ " \"huggingface_hub>=0.23.0\", \"ipywidgets\", \"pydantic<3\", \"uvicorn\", \"fastapi\"])\n",
48
+ " if exitcode != 0:\n",
49
+ " raise RuntimeError(\"GPU wheel failed, falling back to CPU wheel\")\n",
50
+ "except Exception as e:\n",
51
+ " print(\"Falling back to CPU wheel:\", e)\n",
52
+ " subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-qU\",\n",
53
+ " \"llama-cpp-python>=0.2.90\", \"huggingface_hub>=0.23.0\",\n",
54
+ " \"ipywidgets\", \"pydantic<3\", \"uvicorn\", \"fastapi\"])\n",
55
+ "\n",
56
+ "print(\"✅ Installation complete\")"
57
+ ]
58
+ },
59
+ {
60
+ "cell_type": "code",
61
+ "execution_count": null,
62
+ "id": "32f6095b",
63
+ "metadata": {},
64
+ "outputs": [],
65
+ "source": [
66
+ "#@title 🔐 (Optional) Hugging Face login\n",
67
+ "# Enter your HF token if the repo is gated (skip if public)\n",
68
+ "HF_TOKEN = \"\" #@param {type:\"string\"}\n",
69
+ "from huggingface_hub import login\n",
70
+ "if HF_TOKEN:\n",
71
+ " login(HF_TOKEN, add_to_git_credential=True)\n",
72
+ "else:\n",
73
+ " print(\"Skipping login (public repos should work without a token)\")"
74
+ ]
75
+ },
76
+ {
77
+ "cell_type": "code",
78
+ "execution_count": null,
79
+ "id": "892b623e",
80
+ "metadata": {},
81
+ "outputs": [],
82
+ "source": [
83
+ "#@title 📦 Choose model & file (GGUF)\n",
84
+ "from huggingface_hub import hf_hub_download\n",
85
+ "\n",
86
+ "REPO_ID = \"TheBloke/CodeLlama-7B-Instruct-GGUF\" #@param [\"TheBloke/CodeLlama-7B-Instruct-GGUF\"] {allow-input: true}\n",
87
+ "FILENAME = \"codellama-7b-instruct.Q4_K_M.gguf\" #@param [\"codellama-7b-instruct.Q4_K_M.gguf\"] {allow-input: true}\n",
88
+ "\n",
89
+ "print(\"Downloading:\", REPO_ID, FILENAME)\n",
90
+ "model_path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME, local_dir=\"models\", local_dir_use_symlinks=False)\n",
91
+ "print(\"Saved to:\", model_path)"
92
+ ]
93
+ },
94
+ {
95
+ "cell_type": "code",
96
+ "execution_count": null,
97
+ "id": "43921b2d",
98
+ "metadata": {},
99
+ "outputs": [],
100
+ "source": [
101
+ "#@title ⚙️ Load model (GPU if available)\n",
102
+ "from llama_cpp import Llama\n",
103
+ "\n",
104
+ "# Auto-detect context length from filename hints; default 4096\n",
105
+ "n_ctx = 4096\n",
106
+ "llm = Llama(\n",
107
+ " model_path=model_path,\n",
108
+ " n_gpu_layers=-1, # Use GPU if available, otherwise CPU\n",
109
+ " n_ctx=n_ctx,\n",
110
+ " logits_all=False,\n",
111
+ " verbose=False,\n",
112
+ ")\n",
113
+ "print(\"✅ Model loaded\")"
114
+ ]
115
+ },
116
+ {
117
+ "cell_type": "code",
118
+ "execution_count": null,
119
+ "id": "c88b309c",
120
+ "metadata": {},
121
+ "outputs": [],
122
+ "source": [
123
+ "#@title 🗣️ Chat (single turn)\n",
124
+ "system_prompt = \"You are a helpful, polite assistant.\"\n",
125
+ "user_prompt = \"Explain transformers in simple terms.\" #@param {type:\"string\"}\n",
126
+ "max_tokens = 512 #@param {type:\"slider\", min:32, max:2048, step:32}\n",
127
+ "temperature = 0.7 #@param {type:\"number\"}\n",
128
+ "\n",
129
+ "prompt = f\"<|system|>\\n{system_prompt}\\n<|user|>\\n{user_prompt}\\n<|assistant|>\\n\"\n",
130
+ "out = llm(\n",
131
+ " prompt,\n",
132
+ " max_tokens=max_tokens,\n",
133
+ " temperature=temperature,\n",
134
+ " stop=[\"<|user|>\", \"<|system|>\", \"</s>\"]\n",
135
+ ")\n",
136
+ "print(out[\"choices\"][0][\"text\"].strip())"
137
+ ]
138
+ },
139
+ {
140
+ "cell_type": "code",
141
+ "execution_count": null,
142
+ "id": "b1008d49",
143
+ "metadata": {},
144
+ "outputs": [],
145
+ "source": [
146
+ "#@title 🔁 Chat Loop (enter queries in the textbox)\n",
147
+ "import ipywidgets as widgets\n",
148
+ "from IPython.display import display, Markdown\n",
149
+ "\n",
150
+ "system_prompt = widgets.Textarea(\n",
151
+ " value=\"You are a helpful, polite assistant.\",\n",
152
+ " description='System:',\n",
153
+ " layout=widgets.Layout(width='100%', height='80px')\n",
154
+ ")\n",
155
+ "\n",
156
+ "user_box = widgets.Textarea(\n",
157
+ " value=\"Write a Python function to check prime numbers.\",\n",
158
+ " description='User:',\n",
159
+ " layout=widgets.Layout(width='100%', height='100px')\n",
160
+ ")\n",
161
+ "\n",
162
+ "temperature = widgets.FloatSlider(value=0.7, min=0.0, max=1.5, step=0.05, description='Temp')\n",
163
+ "max_tokens = widgets.IntSlider(value=512, min=32, max=2048, step=32, description='Max tokens')\n",
164
+ "\n",
165
+ "run_button = widgets.Button(description=\"Generate\", button_style='success')\n",
166
+ "\n",
167
+ "out_area = widgets.Output()\n",
168
+ "\n",
169
+ "def on_click(_):\n",
170
+ " with out_area:\n",
171
+ " out_area.clear_output()\n",
172
+ " prompt = f\"<|system|>\\n{system_prompt.value}\\n<|user|>\\n{user_box.value}\\n<|assistant|>\\n\"\n",
173
+ " result = llm(prompt, max_tokens=max_tokens.value, temperature=temperature.value, stop=[\"<|user|>\", \"<|system|>\", \"</s>\"])\n",
174
+ " display(Markdown(result[\"choices\"][0][\"text\"].strip()))\n",
175
+ "\n",
176
+ "run_button.on_click(on_click)\n",
177
+ "\n",
178
+ "ui = widgets.VBox([system_prompt, user_box, temperature, max_tokens, run_button, out_area])\n",
179
+ "display(ui)"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "code",
184
+ "execution_count": null,
185
+ "id": "a3739ff5",
186
+ "metadata": {},
187
+ "outputs": [],
188
+ "source": [
189
+ "#@title 🌐 Optional: Start a local API server (OpenAI-compatible-ish)\n",
190
+ "# After running, you can access http://127.0.0.1:8000/docs inside Colab.\n",
191
+ "import threading\n",
192
+ "from llama_cpp.server.app import create_app\n",
193
+ "from fastapi.middleware.cors import CORSMiddleware\n",
194
+ "import uvicorn\n",
195
+ "\n",
196
+ "app = create_app(llm)\n",
197
+ "app.add_middleware(\n",
198
+ " CORSMiddleware,\n",
199
+ " allow_origins=[\"*\"],\n",
200
+ " allow_credentials=True,\n",
201
+ " allow_methods=[\"*\"],\n",
202
+ " allow_headers=[\"*\"],\n",
203
+ ")\n",
204
+ "\n",
205
+ "def run_server():\n",
206
+ " uvicorn.run(app, host=\"0.0.0.0\", port=8000, log_level=\"info\")\n",
207
+ "\n",
208
+ "thread = threading.Thread(target=run_server, daemon=True)\n",
209
+ "thread.start()\n",
210
+ "print(\"Server starting on http://127.0.0.1:8000\")"
211
+ ]
212
+ }
213
+ ],
214
+ "metadata": {},
215
+ "nbformat": 4,
216
+ "nbformat_minor": 5
217
+ }
co-lab/deepseek_coder_6.7b_instruct_gguf,_q4_k_m.ipynb ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "26316acd",
6
+ "metadata": {},
7
+ "source": [
8
+ "# DeepSeek Coder 6.7B Instruct (GGUF, Q4_K_M)\n",
9
+ "**One-click Colab notebook** to run `TheBloke/DeepSeek-Coder-6.7B-instruct-GGUF` (file: `deepseek-coder-6.7b-instruct.Q4_K_M.gguf`) using **llama-cpp-python** on GPU (T4/A100) or CPU.\n",
10
+ "\n",
11
+ "**Features**: Hugging Face login, GGUF download, fast GPU inference, chat UI cell, optional API server.\n",
12
+ "\n",
13
+ "Great for mixed code + chat workloads.\n",
14
+ "---\n",
15
+ "**Tip:** If you're on free Colab, switch to a GPU runtime: **Runtime → Change runtime type → T4 GPU**.\n"
16
+ ]
17
+ },
18
+ {
19
+ "cell_type": "code",
20
+ "execution_count": null,
21
+ "id": "4da5a1d1",
22
+ "metadata": {},
23
+ "outputs": [],
24
+ "source": [
25
+ "#@title 🔧 Check GPU and Python version\n",
26
+ "!nvidia-smi || echo \"No NVIDIA GPU available\"\n",
27
+ "!python --version"
28
+ ]
29
+ },
30
+ {
31
+ "cell_type": "code",
32
+ "execution_count": null,
33
+ "id": "d0d4b92a",
34
+ "metadata": {},
35
+ "outputs": [],
36
+ "source": [
37
+ "#@title ⬇️ Install dependencies (GPU build if available)\n",
38
+ "# If you get build errors, re-run this cell.\n",
39
+ "import os, sys, subprocess\n",
40
+ "\n",
41
+ "cuda_spec = \"cu121\"\n",
42
+ "wheels_index = \"https://abetlen.github.io/llama-cpp-python/whl/\" + cuda_spec\n",
43
+ "try:\n",
44
+ " # Try GPU wheel first\n",
45
+ " exitcode = subprocess.call([sys.executable, \"-m\", \"pip\", \"install\", \"-qU\",\n",
46
+ " f\"--extra-index-url={wheels_index}\", \"llama-cpp-python>=0.2.90\",\n",
47
+ " \"huggingface_hub>=0.23.0\", \"ipywidgets\", \"pydantic<3\", \"uvicorn\", \"fastapi\"])\n",
48
+ " if exitcode != 0:\n",
49
+ " raise RuntimeError(\"GPU wheel failed, falling back to CPU wheel\")\n",
50
+ "except Exception as e:\n",
51
+ " print(\"Falling back to CPU wheel:\", e)\n",
52
+ " subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-qU\",\n",
53
+ " \"llama-cpp-python>=0.2.90\", \"huggingface_hub>=0.23.0\",\n",
54
+ " \"ipywidgets\", \"pydantic<3\", \"uvicorn\", \"fastapi\"])\n",
55
+ "\n",
56
+ "print(\"✅ Installation complete\")"
57
+ ]
58
+ },
59
+ {
60
+ "cell_type": "code",
61
+ "execution_count": null,
62
+ "id": "1be18d48",
63
+ "metadata": {},
64
+ "outputs": [],
65
+ "source": [
66
+ "#@title 🔐 (Optional) Hugging Face login\n",
67
+ "# Enter your HF token if the repo is gated (skip if public)\n",
68
+ "HF_TOKEN = \"\" #@param {type:\"string\"}\n",
69
+ "from huggingface_hub import login\n",
70
+ "if HF_TOKEN:\n",
71
+ " login(HF_TOKEN, add_to_git_credential=True)\n",
72
+ "else:\n",
73
+ " print(\"Skipping login (public repos should work without a token)\")"
74
+ ]
75
+ },
76
+ {
77
+ "cell_type": "code",
78
+ "execution_count": null,
79
+ "id": "df92e5c0",
80
+ "metadata": {},
81
+ "outputs": [],
82
+ "source": [
83
+ "#@title 📦 Choose model & file (GGUF)\n",
84
+ "from huggingface_hub import hf_hub_download\n",
85
+ "\n",
86
+ "REPO_ID = \"TheBloke/DeepSeek-Coder-6.7B-instruct-GGUF\" #@param [\"TheBloke/DeepSeek-Coder-6.7B-instruct-GGUF\"] {allow-input: true}\n",
87
+ "FILENAME = \"deepseek-coder-6.7b-instruct.Q4_K_M.gguf\" #@param [\"deepseek-coder-6.7b-instruct.Q4_K_M.gguf\"] {allow-input: true}\n",
88
+ "\n",
89
+ "print(\"Downloading:\", REPO_ID, FILENAME)\n",
90
+ "model_path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME, local_dir=\"models\", local_dir_use_symlinks=False)\n",
91
+ "print(\"Saved to:\", model_path)"
92
+ ]
93
+ },
94
+ {
95
+ "cell_type": "code",
96
+ "execution_count": null,
97
+ "id": "5a9de05b",
98
+ "metadata": {},
99
+ "outputs": [],
100
+ "source": [
101
+ "#@title ⚙️ Load model (GPU if available)\n",
102
+ "from llama_cpp import Llama\n",
103
+ "\n",
104
+ "# Auto-detect context length from filename hints; default 4096\n",
105
+ "n_ctx = 4096\n",
106
+ "llm = Llama(\n",
107
+ " model_path=model_path,\n",
108
+ " n_gpu_layers=-1, # Use GPU if available, otherwise CPU\n",
109
+ " n_ctx=n_ctx,\n",
110
+ " logits_all=False,\n",
111
+ " verbose=False,\n",
112
+ ")\n",
113
+ "print(\"✅ Model loaded\")"
114
+ ]
115
+ },
116
+ {
117
+ "cell_type": "code",
118
+ "execution_count": null,
119
+ "id": "ce589656",
120
+ "metadata": {},
121
+ "outputs": [],
122
+ "source": [
123
+ "#@title 🗣️ Chat (single turn)\n",
124
+ "system_prompt = \"You are a helpful, polite assistant.\"\n",
125
+ "user_prompt = \"Explain transformers in simple terms.\" #@param {type:\"string\"}\n",
126
+ "max_tokens = 512 #@param {type:\"slider\", min:32, max:2048, step:32}\n",
127
+ "temperature = 0.7 #@param {type:\"number\"}\n",
128
+ "\n",
129
+ "prompt = f\"<|system|>\\n{system_prompt}\\n<|user|>\\n{user_prompt}\\n<|assistant|>\\n\"\n",
130
+ "out = llm(\n",
131
+ " prompt,\n",
132
+ " max_tokens=max_tokens,\n",
133
+ " temperature=temperature,\n",
134
+ " stop=[\"<|user|>\", \"<|system|>\", \"</s>\"]\n",
135
+ ")\n",
136
+ "print(out[\"choices\"][0][\"text\"].strip())"
137
+ ]
138
+ },
139
+ {
140
+ "cell_type": "code",
141
+ "execution_count": null,
142
+ "id": "8d7aa145",
143
+ "metadata": {},
144
+ "outputs": [],
145
+ "source": [
146
+ "#@title 🔁 Chat Loop (enter queries in the textbox)\n",
147
+ "import ipywidgets as widgets\n",
148
+ "from IPython.display import display, Markdown\n",
149
+ "\n",
150
+ "system_prompt = widgets.Textarea(\n",
151
+ " value=\"You are a helpful, polite assistant.\",\n",
152
+ " description='System:',\n",
153
+ " layout=widgets.Layout(width='100%', height='80px')\n",
154
+ ")\n",
155
+ "\n",
156
+ "user_box = widgets.Textarea(\n",
157
+ " value=\"Write a Python function to check prime numbers.\",\n",
158
+ " description='User:',\n",
159
+ " layout=widgets.Layout(width='100%', height='100px')\n",
160
+ ")\n",
161
+ "\n",
162
+ "temperature = widgets.FloatSlider(value=0.7, min=0.0, max=1.5, step=0.05, description='Temp')\n",
163
+ "max_tokens = widgets.IntSlider(value=512, min=32, max=2048, step=32, description='Max tokens')\n",
164
+ "\n",
165
+ "run_button = widgets.Button(description=\"Generate\", button_style='success')\n",
166
+ "\n",
167
+ "out_area = widgets.Output()\n",
168
+ "\n",
169
+ "def on_click(_):\n",
170
+ " with out_area:\n",
171
+ " out_area.clear_output()\n",
172
+ " prompt = f\"<|system|>\\n{system_prompt.value}\\n<|user|>\\n{user_box.value}\\n<|assistant|>\\n\"\n",
173
+ " result = llm(prompt, max_tokens=max_tokens.value, temperature=temperature.value, stop=[\"<|user|>\", \"<|system|>\", \"</s>\"])\n",
174
+ " display(Markdown(result[\"choices\"][0][\"text\"].strip()))\n",
175
+ "\n",
176
+ "run_button.on_click(on_click)\n",
177
+ "\n",
178
+ "ui = widgets.VBox([system_prompt, user_box, temperature, max_tokens, run_button, out_area])\n",
179
+ "display(ui)"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "code",
184
+ "execution_count": null,
185
+ "id": "4284a12d",
186
+ "metadata": {},
187
+ "outputs": [],
188
+ "source": [
189
+ "#@title 🌐 Optional: Start a local API server (OpenAI-compatible-ish)\n",
190
+ "# After running, you can access http://127.0.0.1:8000/docs inside Colab.\n",
191
+ "import threading\n",
192
+ "from llama_cpp.server.app import create_app\n",
193
+ "from fastapi.middleware.cors import CORSMiddleware\n",
194
+ "import uvicorn\n",
195
+ "\n",
196
+ "app = create_app(llm)\n",
197
+ "app.add_middleware(\n",
198
+ " CORSMiddleware,\n",
199
+ " allow_origins=[\"*\"],\n",
200
+ " allow_credentials=True,\n",
201
+ " allow_methods=[\"*\"],\n",
202
+ " allow_headers=[\"*\"],\n",
203
+ ")\n",
204
+ "\n",
205
+ "def run_server():\n",
206
+ " uvicorn.run(app, host=\"0.0.0.0\", port=8000, log_level=\"info\")\n",
207
+ "\n",
208
+ "thread = threading.Thread(target=run_server, daemon=True)\n",
209
+ "thread.start()\n",
210
+ "print(\"Server starting on http://127.0.0.1:8000\")"
211
+ ]
212
+ }
213
+ ],
214
+ "metadata": {},
215
+ "nbformat": 4,
216
+ "nbformat_minor": 5
217
+ }
co-lab/llama_2_13b_chat_gguf,_q4_k_m.ipynb ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "48ff48fe",
6
+ "metadata": {},
7
+ "source": [
8
+ "# Llama-2 13B Chat (GGUF, Q4_K_M)\n",
9
+ "**One-click Colab notebook** to run `TheBloke/Llama-2-13B-Chat-GGUF` (file: `llama-2-13b-chat.Q4_K_M.gguf`) using **llama-cpp-python** on GPU (T4/A100) or CPU.\n",
10
+ "\n",
11
+ "**Features**: Hugging Face login, GGUF download, fast GPU inference, chat UI cell, optional API server.\n",
12
+ "\n",
13
+ "⚠️ Prefer A100 in Colab for this model due to VRAM needs.\n",
14
+ "---\n",
15
+ "**Tip:** If you're on free Colab, switch to a GPU runtime: **Runtime → Change runtime type → T4 GPU**.\n"
16
+ ]
17
+ },
18
+ {
19
+ "cell_type": "code",
20
+ "execution_count": null,
21
+ "id": "2c594cf4",
22
+ "metadata": {},
23
+ "outputs": [],
24
+ "source": [
25
+ "#@title 🔧 Check GPU and Python version\n",
26
+ "!nvidia-smi || echo \"No NVIDIA GPU available\"\n",
27
+ "!python --version"
28
+ ]
29
+ },
30
+ {
31
+ "cell_type": "code",
32
+ "execution_count": null,
33
+ "id": "b651dbee",
34
+ "metadata": {},
35
+ "outputs": [],
36
+ "source": [
37
+ "#@title ⬇️ Install dependencies (GPU build if available)\n",
38
+ "# If you get build errors, re-run this cell.\n",
39
+ "import os, sys, subprocess\n",
40
+ "\n",
41
+ "cuda_spec = \"cu121\"\n",
42
+ "wheels_index = \"https://abetlen.github.io/llama-cpp-python/whl/\" + cuda_spec\n",
43
+ "try:\n",
44
+ " # Try GPU wheel first\n",
45
+ " exitcode = subprocess.call([sys.executable, \"-m\", \"pip\", \"install\", \"-qU\",\n",
46
+ " f\"--extra-index-url={wheels_index}\", \"llama-cpp-python>=0.2.90\",\n",
47
+ " \"huggingface_hub>=0.23.0\", \"ipywidgets\", \"pydantic<3\", \"uvicorn\", \"fastapi\"])\n",
48
+ " if exitcode != 0:\n",
49
+ " raise RuntimeError(\"GPU wheel failed, falling back to CPU wheel\")\n",
50
+ "except Exception as e:\n",
51
+ " print(\"Falling back to CPU wheel:\", e)\n",
52
+ " subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-qU\",\n",
53
+ " \"llama-cpp-python>=0.2.90\", \"huggingface_hub>=0.23.0\",\n",
54
+ " \"ipywidgets\", \"pydantic<3\", \"uvicorn\", \"fastapi\"])\n",
55
+ "\n",
56
+ "print(\"✅ Installation complete\")"
57
+ ]
58
+ },
59
+ {
60
+ "cell_type": "code",
61
+ "execution_count": null,
62
+ "id": "b0fca1d7",
63
+ "metadata": {},
64
+ "outputs": [],
65
+ "source": [
66
+ "#@title 🔐 (Optional) Hugging Face login\n",
67
+ "# Enter your HF token if the repo is gated (skip if public)\n",
68
+ "HF_TOKEN = \"\" #@param {type:\"string\"}\n",
69
+ "from huggingface_hub import login\n",
70
+ "if HF_TOKEN:\n",
71
+ " login(HF_TOKEN, add_to_git_credential=True)\n",
72
+ "else:\n",
73
+ " print(\"Skipping login (public repos should work without a token)\")"
74
+ ]
75
+ },
76
+ {
77
+ "cell_type": "code",
78
+ "execution_count": null,
79
+ "id": "820f574f",
80
+ "metadata": {},
81
+ "outputs": [],
82
+ "source": [
83
+ "#@title 📦 Choose model & file (GGUF)\n",
84
+ "from huggingface_hub import hf_hub_download\n",
85
+ "\n",
86
+ "REPO_ID = \"TheBloke/Llama-2-13B-Chat-GGUF\" #@param [\"TheBloke/Llama-2-13B-Chat-GGUF\"] {allow-input: true}\n",
87
+ "FILENAME = \"llama-2-13b-chat.Q4_K_M.gguf\" #@param [\"llama-2-13b-chat.Q4_K_M.gguf\"] {allow-input: true}\n",
88
+ "\n",
89
+ "print(\"Downloading:\", REPO_ID, FILENAME)\n",
90
+ "model_path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME, local_dir=\"models\", local_dir_use_symlinks=False)\n",
91
+ "print(\"Saved to:\", model_path)"
92
+ ]
93
+ },
94
+ {
95
+ "cell_type": "code",
96
+ "execution_count": null,
97
+ "id": "eecb0e65",
98
+ "metadata": {},
99
+ "outputs": [],
100
+ "source": [
101
+ "#@title ⚙️ Load model (GPU if available)\n",
102
+ "from llama_cpp import Llama\n",
103
+ "\n",
104
+ "# Auto-detect context length from filename hints; default 4096\n",
105
+ "n_ctx = 4096\n",
106
+ "llm = Llama(\n",
107
+ " model_path=model_path,\n",
108
+ " n_gpu_layers=-1, # Use GPU if available, otherwise CPU\n",
109
+ " n_ctx=n_ctx,\n",
110
+ " logits_all=False,\n",
111
+ " verbose=False,\n",
112
+ ")\n",
113
+ "print(\"✅ Model loaded\")"
114
+ ]
115
+ },
116
+ {
117
+ "cell_type": "code",
118
+ "execution_count": null,
119
+ "id": "4ee8332b",
120
+ "metadata": {},
121
+ "outputs": [],
122
+ "source": [
123
+ "#@title 🗣️ Chat (single turn)\n",
124
+ "system_prompt = \"You are a helpful, polite assistant.\"\n",
125
+ "user_prompt = \"Explain transformers in simple terms.\" #@param {type:\"string\"}\n",
126
+ "max_tokens = 512 #@param {type:\"slider\", min:32, max:2048, step:32}\n",
127
+ "temperature = 0.7 #@param {type:\"number\"}\n",
128
+ "\n",
129
+ "prompt = f\"<|system|>\\n{system_prompt}\\n<|user|>\\n{user_prompt}\\n<|assistant|>\\n\"\n",
130
+ "out = llm(\n",
131
+ " prompt,\n",
132
+ " max_tokens=max_tokens,\n",
133
+ " temperature=temperature,\n",
134
+ " stop=[\"<|user|>\", \"<|system|>\", \"</s>\"]\n",
135
+ ")\n",
136
+ "print(out[\"choices\"][0][\"text\"].strip())"
137
+ ]
138
+ },
139
+ {
140
+ "cell_type": "code",
141
+ "execution_count": null,
142
+ "id": "994a9e59",
143
+ "metadata": {},
144
+ "outputs": [],
145
+ "source": [
146
+ "#@title 🔁 Chat Loop (enter queries in the textbox)\n",
147
+ "import ipywidgets as widgets\n",
148
+ "from IPython.display import display, Markdown\n",
149
+ "\n",
150
+ "system_prompt = widgets.Textarea(\n",
151
+ " value=\"You are a helpful, polite assistant.\",\n",
152
+ " description='System:',\n",
153
+ " layout=widgets.Layout(width='100%', height='80px')\n",
154
+ ")\n",
155
+ "\n",
156
+ "user_box = widgets.Textarea(\n",
157
+ " value=\"Write a Python function to check prime numbers.\",\n",
158
+ " description='User:',\n",
159
+ " layout=widgets.Layout(width='100%', height='100px')\n",
160
+ ")\n",
161
+ "\n",
162
+ "temperature = widgets.FloatSlider(value=0.7, min=0.0, max=1.5, step=0.05, description='Temp')\n",
163
+ "max_tokens = widgets.IntSlider(value=512, min=32, max=2048, step=32, description='Max tokens')\n",
164
+ "\n",
165
+ "run_button = widgets.Button(description=\"Generate\", button_style='success')\n",
166
+ "\n",
167
+ "out_area = widgets.Output()\n",
168
+ "\n",
169
+ "def on_click(_):\n",
170
+ " with out_area:\n",
171
+ " out_area.clear_output()\n",
172
+ " prompt = f\"<|system|>\\n{system_prompt.value}\\n<|user|>\\n{user_box.value}\\n<|assistant|>\\n\"\n",
173
+ " result = llm(prompt, max_tokens=max_tokens.value, temperature=temperature.value, stop=[\"<|user|>\", \"<|system|>\", \"</s>\"])\n",
174
+ " display(Markdown(result[\"choices\"][0][\"text\"].strip()))\n",
175
+ "\n",
176
+ "run_button.on_click(on_click)\n",
177
+ "\n",
178
+ "ui = widgets.VBox([system_prompt, user_box, temperature, max_tokens, run_button, out_area])\n",
179
+ "display(ui)"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "code",
184
+ "execution_count": null,
185
+ "id": "249b2265",
186
+ "metadata": {},
187
+ "outputs": [],
188
+ "source": [
189
+ "#@title 🌐 Optional: Start a local API server (OpenAI-compatible-ish)\n",
190
+ "# After running, you can access http://127.0.0.1:8000/docs inside Colab.\n",
191
+ "import threading\n",
192
+ "from llama_cpp.server.app import create_app\n",
193
+ "from fastapi.middleware.cors import CORSMiddleware\n",
194
+ "import uvicorn\n",
195
+ "\n",
196
+ "app = create_app(llm)\n",
197
+ "app.add_middleware(\n",
198
+ " CORSMiddleware,\n",
199
+ " allow_origins=[\"*\"],\n",
200
+ " allow_credentials=True,\n",
201
+ " allow_methods=[\"*\"],\n",
202
+ " allow_headers=[\"*\"],\n",
203
+ ")\n",
204
+ "\n",
205
+ "def run_server():\n",
206
+ " uvicorn.run(app, host=\"0.0.0.0\", port=8000, log_level=\"info\")\n",
207
+ "\n",
208
+ "thread = threading.Thread(target=run_server, daemon=True)\n",
209
+ "thread.start()\n",
210
+ "print(\"Server starting on http://127.0.0.1:8000\")"
211
+ ]
212
+ }
213
+ ],
214
+ "metadata": {},
215
+ "nbformat": 4,
216
+ "nbformat_minor": 5
217
+ }
co-lab/mistral_7b_instruct_gguf,_q4_k_m.ipynb ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "ba14f830",
6
+ "metadata": {},
7
+ "source": [
8
+ "# Mistral 7B Instruct (GGUF, Q4_K_M)\n",
9
+ "**One-click Colab notebook** to run `TheBloke/Mistral-7B-Instruct-v0.2-GGUF` (file: `mistral-7b-instruct-v0.2.Q4_K_M.gguf`) using **llama-cpp-python** on GPU (T4/A100) or CPU.\n",
10
+ "\n",
11
+ "**Features**: Hugging Face login, GGUF download, fast GPU inference, chat UI cell, optional API server.\n",
12
+ "\n",
13
+ "\n",
14
+ "---\n",
15
+ "**Tip:** If you're on free Colab, switch to a GPU runtime: **Runtime → Change runtime type → T4 GPU**.\n"
16
+ ]
17
+ },
18
+ {
19
+ "cell_type": "code",
20
+ "execution_count": null,
21
+ "id": "ad713df8",
22
+ "metadata": {},
23
+ "outputs": [],
24
+ "source": [
25
+ "#@title 🔧 Check GPU and Python version\n",
26
+ "!nvidia-smi || echo \"No NVIDIA GPU available\"\n",
27
+ "!python --version"
28
+ ]
29
+ },
30
+ {
31
+ "cell_type": "code",
32
+ "execution_count": null,
33
+ "id": "fe2fb1d9",
34
+ "metadata": {},
35
+ "outputs": [],
36
+ "source": [
37
+ "#@title ⬇️ Install dependencies (GPU build if available)\n",
38
+ "# If you get build errors, re-run this cell.\n",
39
+ "import os, sys, subprocess\n",
40
+ "\n",
41
+ "cuda_spec = \"cu121\"\n",
42
+ "wheels_index = \"https://abetlen.github.io/llama-cpp-python/whl/\" + cuda_spec\n",
43
+ "try:\n",
44
+ " # Try GPU wheel first\n",
45
+ " exitcode = subprocess.call([sys.executable, \"-m\", \"pip\", \"install\", \"-qU\",\n",
46
+ " f\"--extra-index-url={wheels_index}\", \"llama-cpp-python>=0.2.90\",\n",
47
+ " \"huggingface_hub>=0.23.0\", \"ipywidgets\", \"pydantic<3\", \"uvicorn\", \"fastapi\"])\n",
48
+ " if exitcode != 0:\n",
49
+ " raise RuntimeError(\"GPU wheel failed, falling back to CPU wheel\")\n",
50
+ "except Exception as e:\n",
51
+ " print(\"Falling back to CPU wheel:\", e)\n",
52
+ " subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-qU\",\n",
53
+ " \"llama-cpp-python>=0.2.90\", \"huggingface_hub>=0.23.0\",\n",
54
+ " \"ipywidgets\", \"pydantic<3\", \"uvicorn\", \"fastapi\"])\n",
55
+ "\n",
56
+ "print(\"✅ Installation complete\")"
57
+ ]
58
+ },
59
+ {
60
+ "cell_type": "code",
61
+ "execution_count": null,
62
+ "id": "9da4a253",
63
+ "metadata": {},
64
+ "outputs": [],
65
+ "source": [
66
+ "#@title 🔐 (Optional) Hugging Face login\n",
67
+ "# Enter your HF token if the repo is gated (skip if public)\n",
68
+ "HF_TOKEN = \"\" #@param {type:\"string\"}\n",
69
+ "from huggingface_hub import login\n",
70
+ "if HF_TOKEN:\n",
71
+ " login(HF_TOKEN, add_to_git_credential=True)\n",
72
+ "else:\n",
73
+ " print(\"Skipping login (public repos should work without a token)\")"
74
+ ]
75
+ },
76
+ {
77
+ "cell_type": "code",
78
+ "execution_count": null,
79
+ "id": "36052075",
80
+ "metadata": {},
81
+ "outputs": [],
82
+ "source": [
83
+ "#@title 📦 Choose model & file (GGUF)\n",
84
+ "from huggingface_hub import hf_hub_download\n",
85
+ "\n",
86
+ "REPO_ID = \"TheBloke/Mistral-7B-Instruct-v0.2-GGUF\" #@param [\"TheBloke/Mistral-7B-Instruct-v0.2-GGUF\"] {allow-input: true}\n",
87
+ "FILENAME = \"mistral-7b-instruct-v0.2.Q4_K_M.gguf\" #@param [\"mistral-7b-instruct-v0.2.Q4_K_M.gguf\"] {allow-input: true}\n",
88
+ "\n",
89
+ "print(\"Downloading:\", REPO_ID, FILENAME)\n",
90
+ "model_path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME, local_dir=\"models\", local_dir_use_symlinks=False)\n",
91
+ "print(\"Saved to:\", model_path)"
92
+ ]
93
+ },
94
+ {
95
+ "cell_type": "code",
96
+ "execution_count": null,
97
+ "id": "3bd19d64",
98
+ "metadata": {},
99
+ "outputs": [],
100
+ "source": [
101
+ "#@title ⚙️ Load model (GPU if available)\n",
102
+ "from llama_cpp import Llama\n",
103
+ "\n",
104
+ "# Auto-detect context length from filename hints; default 4096\n",
105
+ "n_ctx = 4096\n",
106
+ "llm = Llama(\n",
107
+ " model_path=model_path,\n",
108
+ " n_gpu_layers=-1, # Use GPU if available, otherwise CPU\n",
109
+ " n_ctx=n_ctx,\n",
110
+ " logits_all=False,\n",
111
+ " verbose=False,\n",
112
+ ")\n",
113
+ "print(\"✅ Model loaded\")"
114
+ ]
115
+ },
116
+ {
117
+ "cell_type": "code",
118
+ "execution_count": null,
119
+ "id": "092b07d6",
120
+ "metadata": {},
121
+ "outputs": [],
122
+ "source": [
123
+ "#@title 🗣️ Chat (single turn)\n",
124
+ "system_prompt = \"You are a helpful, polite assistant.\"\n",
125
+ "user_prompt = \"Explain transformers in simple terms.\" #@param {type:\"string\"}\n",
126
+ "max_tokens = 512 #@param {type:\"slider\", min:32, max:2048, step:32}\n",
127
+ "temperature = 0.7 #@param {type:\"number\"}\n",
128
+ "\n",
129
+ "prompt = f\"<|system|>\\n{system_prompt}\\n<|user|>\\n{user_prompt}\\n<|assistant|>\\n\"\n",
130
+ "out = llm(\n",
131
+ " prompt,\n",
132
+ " max_tokens=max_tokens,\n",
133
+ " temperature=temperature,\n",
134
+ " stop=[\"<|user|>\", \"<|system|>\", \"</s>\"]\n",
135
+ ")\n",
136
+ "print(out[\"choices\"][0][\"text\"].strip())"
137
+ ]
138
+ },
139
+ {
140
+ "cell_type": "code",
141
+ "execution_count": null,
142
+ "id": "62a9469a",
143
+ "metadata": {},
144
+ "outputs": [],
145
+ "source": [
146
+ "#@title 🔁 Chat Loop (enter queries in the textbox)\n",
147
+ "import ipywidgets as widgets\n",
148
+ "from IPython.display import display, Markdown\n",
149
+ "\n",
150
+ "system_prompt = widgets.Textarea(\n",
151
+ " value=\"You are a helpful, polite assistant.\",\n",
152
+ " description='System:',\n",
153
+ " layout=widgets.Layout(width='100%', height='80px')\n",
154
+ ")\n",
155
+ "\n",
156
+ "user_box = widgets.Textarea(\n",
157
+ " value=\"Write a Python function to check prime numbers.\",\n",
158
+ " description='User:',\n",
159
+ " layout=widgets.Layout(width='100%', height='100px')\n",
160
+ ")\n",
161
+ "\n",
162
+ "temperature = widgets.FloatSlider(value=0.7, min=0.0, max=1.5, step=0.05, description='Temp')\n",
163
+ "max_tokens = widgets.IntSlider(value=512, min=32, max=2048, step=32, description='Max tokens')\n",
164
+ "\n",
165
+ "run_button = widgets.Button(description=\"Generate\", button_style='success')\n",
166
+ "\n",
167
+ "out_area = widgets.Output()\n",
168
+ "\n",
169
+ "def on_click(_):\n",
170
+ " with out_area:\n",
171
+ " out_area.clear_output()\n",
172
+ " prompt = f\"<|system|>\\n{system_prompt.value}\\n<|user|>\\n{user_box.value}\\n<|assistant|>\\n\"\n",
173
+ " result = llm(prompt, max_tokens=max_tokens.value, temperature=temperature.value, stop=[\"<|user|>\", \"<|system|>\", \"</s>\"])\n",
174
+ " display(Markdown(result[\"choices\"][0][\"text\"].strip()))\n",
175
+ "\n",
176
+ "run_button.on_click(on_click)\n",
177
+ "\n",
178
+ "ui = widgets.VBox([system_prompt, user_box, temperature, max_tokens, run_button, out_area])\n",
179
+ "display(ui)"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "code",
184
+ "execution_count": null,
185
+ "id": "9b148ffc",
186
+ "metadata": {},
187
+ "outputs": [],
188
+ "source": [
189
+ "#@title 🌐 Optional: Start a local API server (OpenAI-compatible-ish)\n",
190
+ "# After running, you can access http://127.0.0.1:8000/docs inside Colab.\n",
191
+ "import threading\n",
192
+ "from llama_cpp.server.app import create_app\n",
193
+ "from fastapi.middleware.cors import CORSMiddleware\n",
194
+ "import uvicorn\n",
195
+ "\n",
196
+ "app = create_app(llm)\n",
197
+ "app.add_middleware(\n",
198
+ " CORSMiddleware,\n",
199
+ " allow_origins=[\"*\"],\n",
200
+ " allow_credentials=True,\n",
201
+ " allow_methods=[\"*\"],\n",
202
+ " allow_headers=[\"*\"],\n",
203
+ ")\n",
204
+ "\n",
205
+ "def run_server():\n",
206
+ " uvicorn.run(app, host=\"0.0.0.0\", port=8000, log_level=\"info\")\n",
207
+ "\n",
208
+ "thread = threading.Thread(target=run_server, daemon=True)\n",
209
+ "thread.start()\n",
210
+ "print(\"Server starting on http://127.0.0.1:8000\")"
211
+ ]
212
+ }
213
+ ],
214
+ "metadata": {},
215
+ "nbformat": 4,
216
+ "nbformat_minor": 5
217
+ }
co-lab/openhermes_2.5___mistral_7b_gguf,_q4_k_m.ipynb ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "9c2af62c",
6
+ "metadata": {},
7
+ "source": [
8
+ "# OpenHermes 2.5 - Mistral 7B (GGUF, Q4_K_M)\n",
9
+ "**One-click Colab notebook** to run `TheBloke/OpenHermes-2.5-Mistral-7B-GGUF` (file: `openhermes-2.5-mistral-7b.Q4_K_M.gguf`) using **llama-cpp-python** on GPU (T4/A100) or CPU.\n",
10
+ "\n",
11
+ "**Features**: Hugging Face login, GGUF download, fast GPU inference, chat UI cell, optional API server.\n",
12
+ "\n",
13
+ "\n",
14
+ "---\n",
15
+ "**Tip:** If you're on free Colab, switch to a GPU runtime: **Runtime → Change runtime type → T4 GPU**.\n"
16
+ ]
17
+ },
18
+ {
19
+ "cell_type": "code",
20
+ "execution_count": null,
21
+ "id": "9d37c033",
22
+ "metadata": {},
23
+ "outputs": [],
24
+ "source": [
25
+ "#@title 🔧 Check GPU and Python version\n",
26
+ "!nvidia-smi || echo \"No NVIDIA GPU available\"\n",
27
+ "!python --version"
28
+ ]
29
+ },
30
+ {
31
+ "cell_type": "code",
32
+ "execution_count": null,
33
+ "id": "ddd17471",
34
+ "metadata": {},
35
+ "outputs": [],
36
+ "source": [
37
+ "#@title ⬇️ Install dependencies (GPU build if available)\n",
38
+ "# If you get build errors, re-run this cell.\n",
39
+ "import os, sys, subprocess\n",
40
+ "\n",
41
+ "cuda_spec = \"cu121\"\n",
42
+ "wheels_index = \"https://abetlen.github.io/llama-cpp-python/whl/\" + cuda_spec\n",
43
+ "try:\n",
44
+ " # Try GPU wheel first\n",
45
+ " exitcode = subprocess.call([sys.executable, \"-m\", \"pip\", \"install\", \"-qU\",\n",
46
+ " f\"--extra-index-url={wheels_index}\", \"llama-cpp-python>=0.2.90\",\n",
47
+ " \"huggingface_hub>=0.23.0\", \"ipywidgets\", \"pydantic<3\", \"uvicorn\", \"fastapi\"])\n",
48
+ " if exitcode != 0:\n",
49
+ " raise RuntimeError(\"GPU wheel failed, falling back to CPU wheel\")\n",
50
+ "except Exception as e:\n",
51
+ " print(\"Falling back to CPU wheel:\", e)\n",
52
+ " subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-qU\",\n",
53
+ " \"llama-cpp-python>=0.2.90\", \"huggingface_hub>=0.23.0\",\n",
54
+ " \"ipywidgets\", \"pydantic<3\", \"uvicorn\", \"fastapi\"])\n",
55
+ "\n",
56
+ "print(\"✅ Installation complete\")"
57
+ ]
58
+ },
59
+ {
60
+ "cell_type": "code",
61
+ "execution_count": null,
62
+ "id": "d0dde098",
63
+ "metadata": {},
64
+ "outputs": [],
65
+ "source": [
66
+ "#@title 🔐 (Optional) Hugging Face login\n",
67
+ "# Enter your HF token if the repo is gated (skip if public)\n",
68
+ "HF_TOKEN = \"\" #@param {type:\"string\"}\n",
69
+ "from huggingface_hub import login\n",
70
+ "if HF_TOKEN:\n",
71
+ " login(HF_TOKEN, add_to_git_credential=True)\n",
72
+ "else:\n",
73
+ " print(\"Skipping login (public repos should work without a token)\")"
74
+ ]
75
+ },
76
+ {
77
+ "cell_type": "code",
78
+ "execution_count": null,
79
+ "id": "d913e9ce",
80
+ "metadata": {},
81
+ "outputs": [],
82
+ "source": [
83
+ "#@title 📦 Choose model & file (GGUF)\n",
84
+ "from huggingface_hub import hf_hub_download\n",
85
+ "\n",
86
+ "REPO_ID = \"TheBloke/OpenHermes-2.5-Mistral-7B-GGUF\" #@param [\"TheBloke/OpenHermes-2.5-Mistral-7B-GGUF\"] {allow-input: true}\n",
87
+ "FILENAME = \"openhermes-2.5-mistral-7b.Q4_K_M.gguf\" #@param [\"openhermes-2.5-mistral-7b.Q4_K_M.gguf\"] {allow-input: true}\n",
88
+ "\n",
89
+ "print(\"Downloading:\", REPO_ID, FILENAME)\n",
90
+ "model_path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME, local_dir=\"models\", local_dir_use_symlinks=False)\n",
91
+ "print(\"Saved to:\", model_path)"
92
+ ]
93
+ },
94
+ {
95
+ "cell_type": "code",
96
+ "execution_count": null,
97
+ "id": "f38e9750",
98
+ "metadata": {},
99
+ "outputs": [],
100
+ "source": [
101
+ "#@title ⚙️ Load model (GPU if available)\n",
102
+ "from llama_cpp import Llama\n",
103
+ "\n",
104
+ "# Auto-detect context length from filename hints; default 4096\n",
105
+ "n_ctx = 4096\n",
106
+ "llm = Llama(\n",
107
+ " model_path=model_path,\n",
108
+ " n_gpu_layers=-1, # Use GPU if available, otherwise CPU\n",
109
+ " n_ctx=n_ctx,\n",
110
+ " logits_all=False,\n",
111
+ " verbose=False,\n",
112
+ ")\n",
113
+ "print(\"✅ Model loaded\")"
114
+ ]
115
+ },
116
+ {
117
+ "cell_type": "code",
118
+ "execution_count": null,
119
+ "id": "e2910b65",
120
+ "metadata": {},
121
+ "outputs": [],
122
+ "source": [
123
+ "#@title 🗣️ Chat (single turn)\n",
124
+ "system_prompt = \"You are a helpful, polite assistant.\"\n",
125
+ "user_prompt = \"Explain transformers in simple terms.\" #@param {type:\"string\"}\n",
126
+ "max_tokens = 512 #@param {type:\"slider\", min:32, max:2048, step:32}\n",
127
+ "temperature = 0.7 #@param {type:\"number\"}\n",
128
+ "\n",
129
+ "prompt = f\"<|system|>\\n{system_prompt}\\n<|user|>\\n{user_prompt}\\n<|assistant|>\\n\"\n",
130
+ "out = llm(\n",
131
+ " prompt,\n",
132
+ " max_tokens=max_tokens,\n",
133
+ " temperature=temperature,\n",
134
+ " stop=[\"<|user|>\", \"<|system|>\", \"</s>\"]\n",
135
+ ")\n",
136
+ "print(out[\"choices\"][0][\"text\"].strip())"
137
+ ]
138
+ },
139
+ {
140
+ "cell_type": "code",
141
+ "execution_count": null,
142
+ "id": "95361706",
143
+ "metadata": {},
144
+ "outputs": [],
145
+ "source": [
146
+ "#@title 🔁 Chat Loop (enter queries in the textbox)\n",
147
+ "import ipywidgets as widgets\n",
148
+ "from IPython.display import display, Markdown\n",
149
+ "\n",
150
+ "system_prompt = widgets.Textarea(\n",
151
+ " value=\"You are a helpful, polite assistant.\",\n",
152
+ " description='System:',\n",
153
+ " layout=widgets.Layout(width='100%', height='80px')\n",
154
+ ")\n",
155
+ "\n",
156
+ "user_box = widgets.Textarea(\n",
157
+ " value=\"Write a Python function to check prime numbers.\",\n",
158
+ " description='User:',\n",
159
+ " layout=widgets.Layout(width='100%', height='100px')\n",
160
+ ")\n",
161
+ "\n",
162
+ "temperature = widgets.FloatSlider(value=0.7, min=0.0, max=1.5, step=0.05, description='Temp')\n",
163
+ "max_tokens = widgets.IntSlider(value=512, min=32, max=2048, step=32, description='Max tokens')\n",
164
+ "\n",
165
+ "run_button = widgets.Button(description=\"Generate\", button_style='success')\n",
166
+ "\n",
167
+ "out_area = widgets.Output()\n",
168
+ "\n",
169
+ "def on_click(_):\n",
170
+ " with out_area:\n",
171
+ " out_area.clear_output()\n",
172
+ " prompt = f\"<|system|>\\n{system_prompt.value}\\n<|user|>\\n{user_box.value}\\n<|assistant|>\\n\"\n",
173
+ " result = llm(prompt, max_tokens=max_tokens.value, temperature=temperature.value, stop=[\"<|user|>\", \"<|system|>\", \"</s>\"])\n",
174
+ " display(Markdown(result[\"choices\"][0][\"text\"].strip()))\n",
175
+ "\n",
176
+ "run_button.on_click(on_click)\n",
177
+ "\n",
178
+ "ui = widgets.VBox([system_prompt, user_box, temperature, max_tokens, run_button, out_area])\n",
179
+ "display(ui)"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "code",
184
+ "execution_count": null,
185
+ "id": "187f6a82",
186
+ "metadata": {},
187
+ "outputs": [],
188
+ "source": [
189
+ "#@title 🌐 Optional: Start a local API server (OpenAI-compatible-ish)\n",
190
+ "# After running, you can access http://127.0.0.1:8000/docs inside Colab.\n",
191
+ "import threading\n",
192
+ "from llama_cpp.server.app import create_app\n",
193
+ "from fastapi.middleware.cors import CORSMiddleware\n",
194
+ "import uvicorn\n",
195
+ "\n",
196
+ "app = create_app(llm)\n",
197
+ "app.add_middleware(\n",
198
+ " CORSMiddleware,\n",
199
+ " allow_origins=[\"*\"],\n",
200
+ " allow_credentials=True,\n",
201
+ " allow_methods=[\"*\"],\n",
202
+ " allow_headers=[\"*\"],\n",
203
+ ")\n",
204
+ "\n",
205
+ "def run_server():\n",
206
+ " uvicorn.run(app, host=\"0.0.0.0\", port=8000, log_level=\"info\")\n",
207
+ "\n",
208
+ "thread = threading.Thread(target=run_server, daemon=True)\n",
209
+ "thread.start()\n",
210
+ "print(\"Server starting on http://127.0.0.1:8000\")"
211
+ ]
212
+ }
213
+ ],
214
+ "metadata": {},
215
+ "nbformat": 4,
216
+ "nbformat_minor": 5
217
+ }
co-lab/tinyllama_1.1b_chat_gguf,_q4_k_m.ipynb ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "cf52d213",
6
+ "metadata": {},
7
+ "source": [
8
+ "# TinyLlama 1.1B Chat (GGUF, Q4_K_M)\n",
9
+ "**One-click Colab notebook** to run `TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF` (file: `tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf`) using **llama-cpp-python** on GPU (T4/A100) or CPU.\n",
10
+ "\n",
11
+ "**Features**: Hugging Face login, GGUF download, fast GPU inference, chat UI cell, optional API server.\n",
12
+ "\n",
13
+ "Very lightweight; good for CPU-only.\n",
14
+ "---\n",
15
+ "**Tip:** If you're on free Colab, switch to a GPU runtime: **Runtime → Change runtime type → T4 GPU**.\n"
16
+ ]
17
+ },
18
+ {
19
+ "cell_type": "code",
20
+ "execution_count": null,
21
+ "id": "579dab88",
22
+ "metadata": {},
23
+ "outputs": [],
24
+ "source": [
25
+ "#@title 🔧 Check GPU and Python version\n",
26
+ "!nvidia-smi || echo \"No NVIDIA GPU available\"\n",
27
+ "!python --version"
28
+ ]
29
+ },
30
+ {
31
+ "cell_type": "code",
32
+ "execution_count": null,
33
+ "id": "a63cc8a9",
34
+ "metadata": {},
35
+ "outputs": [],
36
+ "source": [
37
+ "#@title ⬇️ Install dependencies (GPU build if available)\n",
38
+ "# If you get build errors, re-run this cell.\n",
39
+ "import os, sys, subprocess\n",
40
+ "\n",
41
+ "cuda_spec = \"cu121\"\n",
42
+ "wheels_index = \"https://abetlen.github.io/llama-cpp-python/whl/\" + cuda_spec\n",
43
+ "try:\n",
44
+ " # Try GPU wheel first\n",
45
+ " exitcode = subprocess.call([sys.executable, \"-m\", \"pip\", \"install\", \"-qU\",\n",
46
+ " f\"--extra-index-url={wheels_index}\", \"llama-cpp-python>=0.2.90\",\n",
47
+ " \"huggingface_hub>=0.23.0\", \"ipywidgets\", \"pydantic<3\", \"uvicorn\", \"fastapi\"])\n",
48
+ " if exitcode != 0:\n",
49
+ " raise RuntimeError(\"GPU wheel failed, falling back to CPU wheel\")\n",
50
+ "except Exception as e:\n",
51
+ " print(\"Falling back to CPU wheel:\", e)\n",
52
+ " subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-qU\",\n",
53
+ " \"llama-cpp-python>=0.2.90\", \"huggingface_hub>=0.23.0\",\n",
54
+ " \"ipywidgets\", \"pydantic<3\", \"uvicorn\", \"fastapi\"])\n",
55
+ "\n",
56
+ "print(\"✅ Installation complete\")"
57
+ ]
58
+ },
59
+ {
60
+ "cell_type": "code",
61
+ "execution_count": null,
62
+ "id": "f44766ae",
63
+ "metadata": {},
64
+ "outputs": [],
65
+ "source": [
66
+ "#@title 🔐 (Optional) Hugging Face login\n",
67
+ "# Enter your HF token if the repo is gated (skip if public)\n",
68
+ "HF_TOKEN = \"\" #@param {type:\"string\"}\n",
69
+ "from huggingface_hub import login\n",
70
+ "if HF_TOKEN:\n",
71
+ " login(HF_TOKEN, add_to_git_credential=True)\n",
72
+ "else:\n",
73
+ " print(\"Skipping login (public repos should work without a token)\")"
74
+ ]
75
+ },
76
+ {
77
+ "cell_type": "code",
78
+ "execution_count": null,
79
+ "id": "f7008eb8",
80
+ "metadata": {},
81
+ "outputs": [],
82
+ "source": [
83
+ "#@title 📦 Choose model & file (GGUF)\n",
84
+ "from huggingface_hub import hf_hub_download\n",
85
+ "\n",
86
+ "REPO_ID = \"TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF\" #@param [\"TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF\"] {allow-input: true}\n",
87
+ "FILENAME = \"tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf\" #@param [\"tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf\"] {allow-input: true}\n",
88
+ "\n",
89
+ "print(\"Downloading:\", REPO_ID, FILENAME)\n",
90
+ "model_path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME, local_dir=\"models\", local_dir_use_symlinks=False)\n",
91
+ "print(\"Saved to:\", model_path)"
92
+ ]
93
+ },
94
+ {
95
+ "cell_type": "code",
96
+ "execution_count": null,
97
+ "id": "b49d39a5",
98
+ "metadata": {},
99
+ "outputs": [],
100
+ "source": [
101
+ "#@title ⚙️ Load model (GPU if available)\n",
102
+ "from llama_cpp import Llama\n",
103
+ "\n",
104
+ "# Auto-detect context length from filename hints; default 4096\n",
105
+ "n_ctx = 4096\n",
106
+ "llm = Llama(\n",
107
+ " model_path=model_path,\n",
108
+ " n_gpu_layers=-1, # Use GPU if available, otherwise CPU\n",
109
+ " n_ctx=n_ctx,\n",
110
+ " logits_all=False,\n",
111
+ " verbose=False,\n",
112
+ ")\n",
113
+ "print(\"✅ Model loaded\")"
114
+ ]
115
+ },
116
+ {
117
+ "cell_type": "code",
118
+ "execution_count": null,
119
+ "id": "2aac08c3",
120
+ "metadata": {},
121
+ "outputs": [],
122
+ "source": [
123
+ "#@title 🗣️ Chat (single turn)\n",
124
+ "system_prompt = \"You are a helpful, polite assistant.\"\n",
125
+ "user_prompt = \"Explain transformers in simple terms.\" #@param {type:\"string\"}\n",
126
+ "max_tokens = 512 #@param {type:\"slider\", min:32, max:2048, step:32}\n",
127
+ "temperature = 0.7 #@param {type:\"number\"}\n",
128
+ "\n",
129
+ "prompt = f\"<|system|>\\n{system_prompt}\\n<|user|>\\n{user_prompt}\\n<|assistant|>\\n\"\n",
130
+ "out = llm(\n",
131
+ " prompt,\n",
132
+ " max_tokens=max_tokens,\n",
133
+ " temperature=temperature,\n",
134
+ " stop=[\"<|user|>\", \"<|system|>\", \"</s>\"]\n",
135
+ ")\n",
136
+ "print(out[\"choices\"][0][\"text\"].strip())"
137
+ ]
138
+ },
139
+ {
140
+ "cell_type": "code",
141
+ "execution_count": null,
142
+ "id": "7494f779",
143
+ "metadata": {},
144
+ "outputs": [],
145
+ "source": [
146
+ "#@title 🔁 Chat Loop (enter queries in the textbox)\n",
147
+ "import ipywidgets as widgets\n",
148
+ "from IPython.display import display, Markdown\n",
149
+ "\n",
150
+ "system_prompt = widgets.Textarea(\n",
151
+ " value=\"You are a helpful, polite assistant.\",\n",
152
+ " description='System:',\n",
153
+ " layout=widgets.Layout(width='100%', height='80px')\n",
154
+ ")\n",
155
+ "\n",
156
+ "user_box = widgets.Textarea(\n",
157
+ " value=\"Write a Python function to check prime numbers.\",\n",
158
+ " description='User:',\n",
159
+ " layout=widgets.Layout(width='100%', height='100px')\n",
160
+ ")\n",
161
+ "\n",
162
+ "temperature = widgets.FloatSlider(value=0.7, min=0.0, max=1.5, step=0.05, description='Temp')\n",
163
+ "max_tokens = widgets.IntSlider(value=512, min=32, max=2048, step=32, description='Max tokens')\n",
164
+ "\n",
165
+ "run_button = widgets.Button(description=\"Generate\", button_style='success')\n",
166
+ "\n",
167
+ "out_area = widgets.Output()\n",
168
+ "\n",
169
+ "def on_click(_):\n",
170
+ " with out_area:\n",
171
+ " out_area.clear_output()\n",
172
+ " prompt = f\"<|system|>\\n{system_prompt.value}\\n<|user|>\\n{user_box.value}\\n<|assistant|>\\n\"\n",
173
+ " result = llm(prompt, max_tokens=max_tokens.value, temperature=temperature.value, stop=[\"<|user|>\", \"<|system|>\", \"</s>\"])\n",
174
+ " display(Markdown(result[\"choices\"][0][\"text\"].strip()))\n",
175
+ "\n",
176
+ "run_button.on_click(on_click)\n",
177
+ "\n",
178
+ "ui = widgets.VBox([system_prompt, user_box, temperature, max_tokens, run_button, out_area])\n",
179
+ "display(ui)"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "code",
184
+ "execution_count": null,
185
+ "id": "dda17cdf",
186
+ "metadata": {},
187
+ "outputs": [],
188
+ "source": [
189
+ "#@title 🌐 Optional: Start a local API server (OpenAI-compatible-ish)\n",
190
+ "# After running, you can access http://127.0.0.1:8000/docs inside Colab.\n",
191
+ "import threading\n",
192
+ "from llama_cpp.server.app import create_app\n",
193
+ "from fastapi.middleware.cors import CORSMiddleware\n",
194
+ "import uvicorn\n",
195
+ "\n",
196
+ "app = create_app(llm)\n",
197
+ "app.add_middleware(\n",
198
+ " CORSMiddleware,\n",
199
+ " allow_origins=[\"*\"],\n",
200
+ " allow_credentials=True,\n",
201
+ " allow_methods=[\"*\"],\n",
202
+ " allow_headers=[\"*\"],\n",
203
+ ")\n",
204
+ "\n",
205
+ "def run_server():\n",
206
+ " uvicorn.run(app, host=\"0.0.0.0\", port=8000, log_level=\"info\")\n",
207
+ "\n",
208
+ "thread = threading.Thread(target=run_server, daemon=True)\n",
209
+ "thread.start()\n",
210
+ "print(\"Server starting on http://127.0.0.1:8000\")"
211
+ ]
212
+ }
213
+ ],
214
+ "metadata": {},
215
+ "nbformat": 4,
216
+ "nbformat_minor": 5
217
+ }
co-lab/zephyr_7b_beta_gguf,_q4_k_m.ipynb ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "aec8642b",
6
+ "metadata": {},
7
+ "source": [
8
+ "# Zephyr 7B Beta (GGUF, Q4_K_M)\n",
9
+ "**One-click Colab notebook** to run `TheBloke/zephyr-7B-beta-GGUF` (file: `zephyr-7b-beta.Q4_K_M.gguf`) using **llama-cpp-python** on GPU (T4/A100) or CPU.\n",
10
+ "\n",
11
+ "**Features**: Hugging Face login, GGUF download, fast GPU inference, chat UI cell, optional API server.\n",
12
+ "\n",
13
+ "\n",
14
+ "---\n",
15
+ "**Tip:** If you're on free Colab, switch to a GPU runtime: **Runtime → Change runtime type → T4 GPU**.\n"
16
+ ]
17
+ },
18
+ {
19
+ "cell_type": "code",
20
+ "execution_count": null,
21
+ "id": "0ecdac0c",
22
+ "metadata": {},
23
+ "outputs": [],
24
+ "source": [
25
+ "#@title 🔧 Check GPU and Python version\n",
26
+ "!nvidia-smi || echo \"No NVIDIA GPU available\"\n",
27
+ "!python --version"
28
+ ]
29
+ },
30
+ {
31
+ "cell_type": "code",
32
+ "execution_count": null,
33
+ "id": "1f7b92f8",
34
+ "metadata": {},
35
+ "outputs": [],
36
+ "source": [
37
+ "#@title ⬇️ Install dependencies (GPU build if available)\n",
38
+ "# If you get build errors, re-run this cell.\n",
39
+ "import os, sys, subprocess\n",
40
+ "\n",
41
+ "cuda_spec = \"cu121\"\n",
42
+ "wheels_index = \"https://abetlen.github.io/llama-cpp-python/whl/\" + cuda_spec\n",
43
+ "try:\n",
44
+ " # Try GPU wheel first\n",
45
+ " exitcode = subprocess.call([sys.executable, \"-m\", \"pip\", \"install\", \"-qU\",\n",
46
+ " f\"--extra-index-url={wheels_index}\", \"llama-cpp-python>=0.2.90\",\n",
47
+ " \"huggingface_hub>=0.23.0\", \"ipywidgets\", \"pydantic<3\", \"uvicorn\", \"fastapi\"])\n",
48
+ " if exitcode != 0:\n",
49
+ " raise RuntimeError(\"GPU wheel failed, falling back to CPU wheel\")\n",
50
+ "except Exception as e:\n",
51
+ " print(\"Falling back to CPU wheel:\", e)\n",
52
+ " subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-qU\",\n",
53
+ " \"llama-cpp-python>=0.2.90\", \"huggingface_hub>=0.23.0\",\n",
54
+ " \"ipywidgets\", \"pydantic<3\", \"uvicorn\", \"fastapi\"])\n",
55
+ "\n",
56
+ "print(\"✅ Installation complete\")"
57
+ ]
58
+ },
59
+ {
60
+ "cell_type": "code",
61
+ "execution_count": null,
62
+ "id": "141768d1",
63
+ "metadata": {},
64
+ "outputs": [],
65
+ "source": [
66
+ "#@title 🔐 (Optional) Hugging Face login\n",
67
+ "# Enter your HF token if the repo is gated (skip if public)\n",
68
+ "HF_TOKEN = \"\" #@param {type:\"string\"}\n",
69
+ "from huggingface_hub import login\n",
70
+ "if HF_TOKEN:\n",
71
+ " login(HF_TOKEN, add_to_git_credential=True)\n",
72
+ "else:\n",
73
+ " print(\"Skipping login (public repos should work without a token)\")"
74
+ ]
75
+ },
76
+ {
77
+ "cell_type": "code",
78
+ "execution_count": null,
79
+ "id": "362ebcb2",
80
+ "metadata": {},
81
+ "outputs": [],
82
+ "source": [
83
+ "#@title 📦 Choose model & file (GGUF)\n",
84
+ "from huggingface_hub import hf_hub_download\n",
85
+ "\n",
86
+ "REPO_ID = \"TheBloke/zephyr-7B-beta-GGUF\" #@param [\"TheBloke/zephyr-7B-beta-GGUF\"] {allow-input: true}\n",
87
+ "FILENAME = \"zephyr-7b-beta.Q4_K_M.gguf\" #@param [\"zephyr-7b-beta.Q4_K_M.gguf\"] {allow-input: true}\n",
88
+ "\n",
89
+ "print(\"Downloading:\", REPO_ID, FILENAME)\n",
90
+ "model_path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME, local_dir=\"models\", local_dir_use_symlinks=False)\n",
91
+ "print(\"Saved to:\", model_path)"
92
+ ]
93
+ },
94
+ {
95
+ "cell_type": "code",
96
+ "execution_count": null,
97
+ "id": "71308945",
98
+ "metadata": {},
99
+ "outputs": [],
100
+ "source": [
101
+ "#@title ⚙️ Load model (GPU if available)\n",
102
+ "from llama_cpp import Llama\n",
103
+ "\n",
104
+ "# Auto-detect context length from filename hints; default 4096\n",
105
+ "n_ctx = 4096\n",
106
+ "llm = Llama(\n",
107
+ " model_path=model_path,\n",
108
+ " n_gpu_layers=-1, # Use GPU if available, otherwise CPU\n",
109
+ " n_ctx=n_ctx,\n",
110
+ " logits_all=False,\n",
111
+ " verbose=False,\n",
112
+ ")\n",
113
+ "print(\"✅ Model loaded\")"
114
+ ]
115
+ },
116
+ {
117
+ "cell_type": "code",
118
+ "execution_count": null,
119
+ "id": "887f05da",
120
+ "metadata": {},
121
+ "outputs": [],
122
+ "source": [
123
+ "#@title 🗣️ Chat (single turn)\n",
124
+ "system_prompt = \"You are a helpful, polite assistant.\"\n",
125
+ "user_prompt = \"Explain transformers in simple terms.\" #@param {type:\"string\"}\n",
126
+ "max_tokens = 512 #@param {type:\"slider\", min:32, max:2048, step:32}\n",
127
+ "temperature = 0.7 #@param {type:\"number\"}\n",
128
+ "\n",
129
+ "prompt = f\"<|system|>\\n{system_prompt}\\n<|user|>\\n{user_prompt}\\n<|assistant|>\\n\"\n",
130
+ "out = llm(\n",
131
+ " prompt,\n",
132
+ " max_tokens=max_tokens,\n",
133
+ " temperature=temperature,\n",
134
+ " stop=[\"<|user|>\", \"<|system|>\", \"</s>\"]\n",
135
+ ")\n",
136
+ "print(out[\"choices\"][0][\"text\"].strip())"
137
+ ]
138
+ },
139
+ {
140
+ "cell_type": "code",
141
+ "execution_count": null,
142
+ "id": "ca60e43d",
143
+ "metadata": {},
144
+ "outputs": [],
145
+ "source": [
146
+ "#@title 🔁 Chat Loop (enter queries in the textbox)\n",
147
+ "import ipywidgets as widgets\n",
148
+ "from IPython.display import display, Markdown\n",
149
+ "\n",
150
+ "system_prompt = widgets.Textarea(\n",
151
+ " value=\"You are a helpful, polite assistant.\",\n",
152
+ " description='System:',\n",
153
+ " layout=widgets.Layout(width='100%', height='80px')\n",
154
+ ")\n",
155
+ "\n",
156
+ "user_box = widgets.Textarea(\n",
157
+ " value=\"Write a Python function to check prime numbers.\",\n",
158
+ " description='User:',\n",
159
+ " layout=widgets.Layout(width='100%', height='100px')\n",
160
+ ")\n",
161
+ "\n",
162
+ "temperature = widgets.FloatSlider(value=0.7, min=0.0, max=1.5, step=0.05, description='Temp')\n",
163
+ "max_tokens = widgets.IntSlider(value=512, min=32, max=2048, step=32, description='Max tokens')\n",
164
+ "\n",
165
+ "run_button = widgets.Button(description=\"Generate\", button_style='success')\n",
166
+ "\n",
167
+ "out_area = widgets.Output()\n",
168
+ "\n",
169
+ "def on_click(_):\n",
170
+ " with out_area:\n",
171
+ " out_area.clear_output()\n",
172
+ " prompt = f\"<|system|>\\n{system_prompt.value}\\n<|user|>\\n{user_box.value}\\n<|assistant|>\\n\"\n",
173
+ " result = llm(prompt, max_tokens=max_tokens.value, temperature=temperature.value, stop=[\"<|user|>\", \"<|system|>\", \"</s>\"])\n",
174
+ " display(Markdown(result[\"choices\"][0][\"text\"].strip()))\n",
175
+ "\n",
176
+ "run_button.on_click(on_click)\n",
177
+ "\n",
178
+ "ui = widgets.VBox([system_prompt, user_box, temperature, max_tokens, run_button, out_area])\n",
179
+ "display(ui)"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "code",
184
+ "execution_count": null,
185
+ "id": "f832998a",
186
+ "metadata": {},
187
+ "outputs": [],
188
+ "source": [
189
+ "#@title 🌐 Optional: Start a local API server (OpenAI-compatible-ish)\n",
190
+ "# After running, you can access http://127.0.0.1:8000/docs inside Colab.\n",
191
+ "import threading\n",
192
+ "from llama_cpp.server.app import create_app\n",
193
+ "from fastapi.middleware.cors import CORSMiddleware\n",
194
+ "import uvicorn\n",
195
+ "\n",
196
+ "app = create_app(llm)\n",
197
+ "app.add_middleware(\n",
198
+ " CORSMiddleware,\n",
199
+ " allow_origins=[\"*\"],\n",
200
+ " allow_credentials=True,\n",
201
+ " allow_methods=[\"*\"],\n",
202
+ " allow_headers=[\"*\"],\n",
203
+ ")\n",
204
+ "\n",
205
+ "def run_server():\n",
206
+ " uvicorn.run(app, host=\"0.0.0.0\", port=8000, log_level=\"info\")\n",
207
+ "\n",
208
+ "thread = threading.Thread(target=run_server, daemon=True)\n",
209
+ "thread.start()\n",
210
+ "print(\"Server starting on http://127.0.0.1:8000\")"
211
+ ]
212
+ }
213
+ ],
214
+ "metadata": {},
215
+ "nbformat": 4,
216
+ "nbformat_minor": 5
217
+ }