NuisanceValue commited on
Commit
2549a20
·
verified ·
1 Parent(s): a4a8867

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -6
README.md CHANGED
@@ -33,9 +33,9 @@ The following GGUF quantized variants of MetalGPT-1 are provided:
33
 
34
  | File name | Quantization | Size (GB) | Notes |
35
  | :------------------------- | :----------- | :-------- | :------------------------------------------------------------- |
36
- | `MetalGPT-1-32B-Q8_0.gguf` | Q8_0 | 34.8 | High quality, high VRAM |
37
- | `MetalGPT-1-32B-Q6_K.gguf` | Q6_K | 26.9 | High quality, but less than Q8_0, less VRAM |
38
- | `MetalGPT-1-32B-Q4_K_M.gguf` | Q4_K_M | 19.8 | Good quality, memoryefficient |
39
  | `MetalGPT-1-32B-Q4_K_S.gguf` | Q4_K_S | 18.8 | Slightly more aggressive quantization than Q4_K_M |
40
 
41
  Choose a variant based on your hardware and quality requirements:
@@ -43,7 +43,19 @@ Choose a variant based on your hardware and quality requirements:
43
  - **Q4_K_M / Q4_K_S**: best options for low‑VRAM environments.
44
  - **Q6_K / Q8_0**: better fidelity for demanding generation quality.
45
 
46
- *Note: Try adding the `/think` tag to your prompts if you want to explicitly trigger reasoning capabilities.*
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  ---
49
 
@@ -67,7 +79,7 @@ Choose a variant based on your hardware and quality requirements:
67
  ollama list
68
  ```
69
 
70
- *Note: You can also use Ollama through a web UI such as [OpenWebUI](https://github.com/open-webui/open-webui) by configuring it to connect to your Ollama server.*
71
 
72
 
73
  ## Usage with `llama.cpp`
@@ -84,6 +96,8 @@ Download one of the GGUF files (for example `MetalGPT-1-32B-Q4_K_M.gguf`) and ru
84
  --ctx-size 8192
85
  ```
86
 
 
 
87
 
88
  ## Usage with `llama-cpp-python`
89
 
@@ -102,7 +116,7 @@ model_path = "MetalGPT-1-32B-Q4_K_M.gguf"
102
  # Initialize the model
103
  llm = Llama(
104
  model_path=model_path,
105
- n_gpu_layers=-1, # Offload all layers to GPU. If you get an OOM error, change this number (e.g., to 20 or 30).
106
  n_ctx=8192, # Context window (adjust based on VRAM)
107
  verbose=False
108
  )
 
33
 
34
  | File name | Quantization | Size (GB) | Notes |
35
  | :------------------------- | :----------- | :-------- | :------------------------------------------------------------- |
36
+ | `MetalGPT-1-32B-Q8_0.gguf` | Q8_0 | 34.8 | Best quality among these quants; requires more VRAM |
37
+ | `MetalGPT-1-32B-Q6_K.gguf` | Q6_K | 26.9 | High quality; lower VRAM usage than Q8_0 |
38
+ | `MetalGPT-1-32B-Q4_K_M.gguf` | Q4_K_M | 19.8 | Good quality; memory-efficient |
39
  | `MetalGPT-1-32B-Q4_K_S.gguf` | Q4_K_S | 18.8 | Slightly more aggressive quantization than Q4_K_M |
40
 
41
  Choose a variant based on your hardware and quality requirements:
 
43
  - **Q4_K_M / Q4_K_S**: best options for low‑VRAM environments.
44
  - **Q6_K / Q8_0**: better fidelity for demanding generation quality.
45
 
46
+ > **Note:** Try adding the `/think` tag to your prompts if you want to explicitly trigger reasoning capabilities.
47
+
48
+ ### VRAM guidance
49
+
50
+ These numbers are rough rules of thumb for **32B** GGUF inference; actual VRAM/RAM usage depends on runtime/backend, context size (KV cache), and overhead.
51
+
52
+ - **< 24 GB VRAM**: you’ll likely need **partial GPU offload** (some weights/layers stay in system RAM). Prefer **Q4_K_M / Q4_K_S**.
53
+ - **~24 GB VRAM**: **Q4** variants typically fit best; higher quants may still require partial offload depending on context size.
54
+ - **~32 GB VRAM**: **Q6_K** is a reasonable target; may still require tuning/offload for large contexts.
55
+ - **40 GB+ VRAM**: **Q8_0** is usually the go-to “max fidelity quant” option among the listed files.
56
+ - **80 GB+ VRAM**: consider running the **original (non-quantized) weights** instead of quants if you want maximum fidelity.
57
+
58
+ > **Note:** **partial offload** (keeping some layers in system RAM) can significantly reduce throughput vs full GPU offload.
59
 
60
  ---
61
 
 
79
  ollama list
80
  ```
81
 
82
+ > **Note:** You can also use Ollama through a web UI such as [OpenWebUI](https://github.com/open-webui/open-webui) by configuring it to connect to your Ollama server.
83
 
84
 
85
  ## Usage with `llama.cpp`
 
96
  --ctx-size 8192
97
  ```
98
 
99
+ > **Tip (GPU offload):** you can add `-ngl N` (aka `--n-gpu-layers`) — it controls how many layers are offloaded to VRAM, while the rest stays in system RAM. Start with `-ngl -1` (try to offload all layers); if you hit an out-of-memory error, lower it (e.g., `-ngl 20`, `-ngl 30`, …) until it fits.
100
+
101
 
102
  ## Usage with `llama-cpp-python`
103
 
 
116
  # Initialize the model
117
  llm = Llama(
118
  model_path=model_path,
119
+ n_gpu_layers=-1, # Offload all layers to GPU. If you get an OOM error, change this number to offload some layers to RAM (e.g., to 20 or 30).
120
  n_ctx=8192, # Context window (adjust based on VRAM)
121
  verbose=False
122
  )