Update README.md
Browse files
README.md
CHANGED
|
@@ -21,7 +21,7 @@ language:
|
|
| 21 |
- The Context Length affect too much to your Memory. Let's say I have 16GB Vram card, I can run the model in 2 ways, using Text-Generation-WebUI:
|
| 22 |
1. Inference: download the origin model, apply args: ``--load-in-4bit --use_double_quant``. I can run all of my model in leaderboard. The bigger parameter is, the slower token can generate. (Ex:7B model could run in 15 token/s, since 3x7b model could only run in ~4-5 token/s)
|
| 23 |
2. GGUF Quantization (Fastest,cheapest way to run): After you downloaded GGUF version of those models, sometimes, you can't run it although you can run other model that have bigger parameter. That because:
|
| 24 |
-
- The context length: 16GB VRAM GPU could run maximum 2x10.7B (~19.2B) model with 4k context length. HyouKan is 3x7B(~18.5B) parameter, but have 8k(or 32k) context length that need a lot of RAM/VRAM to load. (``--auto-devices`` may help you run the model, I don't know.) => 7B 32k is ~ 13b 4k RAM/VRAM usage in GGUF version.
|
| 25 |
- That model is bug/broken.😏
|
| 26 |
- Bigger model will have more information that you need for your Character Card.
|
| 27 |
- Best GGUF version that you should run (balance speed/performance): Q4_K_M, Q5_K_M (Slower than Q4)
|
|
|
|
| 21 |
- The Context Length affect too much to your Memory. Let's say I have 16GB Vram card, I can run the model in 2 ways, using Text-Generation-WebUI:
|
| 22 |
1. Inference: download the origin model, apply args: ``--load-in-4bit --use_double_quant``. I can run all of my model in leaderboard. The bigger parameter is, the slower token can generate. (Ex:7B model could run in 15 token/s, since 3x7b model could only run in ~4-5 token/s)
|
| 23 |
2. GGUF Quantization (Fastest,cheapest way to run): After you downloaded GGUF version of those models, sometimes, you can't run it although you can run other model that have bigger parameter. That because:
|
| 24 |
+
- The context length: 16GB VRAM GPU could run maximum 2x10.7B (~ 19.2B) model with 4k context length. HyouKan is 3x7B(~ 18.5B) parameter, but have 8k(or 32k) context length that need a lot of RAM/VRAM to load. (``--auto-devices`` may help you run the model, I don't know.) => 7B 32k is ~ 13b 4k RAM/VRAM usage in GGUF version.
|
| 25 |
- That model is bug/broken.😏
|
| 26 |
- Bigger model will have more information that you need for your Character Card.
|
| 27 |
- Best GGUF version that you should run (balance speed/performance): Q4_K_M, Q5_K_M (Slower than Q4)
|