Add imatrix computing tips
Browse filesAs a follow-up to https://huggingface.co/mradermacher/Meta-Llama-3.1-70B-i1-GGUF/discussions/1 and, hopefully, partial redemption for not having located the FAQ before, here's some extra imatrix computing tips that would be nice to add
README.md
CHANGED
|
@@ -142,6 +142,21 @@ and then run another command which handles download/computation/upload. Most of
|
|
| 142 |
to do stuff when things go wrong (which, with llama.cpp being so buggy and hard to use,
|
| 143 |
is unfortunately very frequent).
|
| 144 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
## Why don't you use gguf-split?
|
| 146 |
|
| 147 |
TL;DR: I don't have the hardware/resources for that.
|
|
|
|
| 142 |
to do stuff when things go wrong (which, with llama.cpp being so buggy and hard to use,
|
| 143 |
is unfortunately very frequent).
|
| 144 |
|
| 145 |
+
## What do I need to do to compute imatrix files for large models?
|
| 146 |
+
|
| 147 |
+
### Hardware
|
| 148 |
+
|
| 149 |
+
* RAM: A lot of RAM is required to compute imatrix files. Example: 512 GB is just enough to compute 405B imatrix quants in Q8.
|
| 150 |
+
* GPU: At least 8 GB of memory.
|
| 151 |
+
|
| 152 |
+
### Extra tips
|
| 153 |
+
|
| 154 |
+
* Computing 405B imatrix quants in Q8 does not seem to have any noticeable quality impact compared to BF16, so to save on hardware
|
| 155 |
+
requirements, use Q8.
|
| 156 |
+
* Sometimes, a single node may not have enough RAM to compute the imatrix file. In such cases, `llama-rpc` inside llama.cpp can
|
| 157 |
+
be used to combine the RAM/VRAM of multiple nodes. This approach takes longer: computing the 405B imatrix file in BF16 takes
|
| 158 |
+
around 20 hours using 3 nodes with 512 GB, 256 GB, and 128 GB of RAM, compared to 4 hours for Q8 on a single node.
|
| 159 |
+
|
| 160 |
## Why don't you use gguf-split?
|
| 161 |
|
| 162 |
TL;DR: I don't have the hardware/resources for that.
|