Alsebay
/

My_LLMs_Leaderboard

Model card Files Files and versions

Alsebay commited on Apr 3, 2024

Commit

41a7109

·

verified ·

1 Parent(s): deefe71

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -21,7 +21,7 @@ language:
 - The Context Length affect too much to your Memory. Let's say I have 16GB Vram card, I can run the model in 2 ways, using Text-Generation-WebUI:
   1. Inference: download the origin model, apply args: ``--load-in-4bit --use_double_quant``. I can run all of my model in leaderboard. The bigger parameter is, the slower token can generate. (Ex:7B model could run in 15 token/s, since 3x7b model could only run in ~4-5 token/s)
   2. GGUF Quantization (Fastest,cheapest way to run): After you downloaded GGUF version of those models, sometimes, you can't run it although you can run other model that have bigger parameter. That because:
-       - The context length: 16GB VRAM GPU could run maximum 2x10.7B (~19.2B) model with 4k context length. HyouKan is 3x7B(~18.5B) parameter, but have 8k(or 32k) context length that need a lot of RAM/VRAM to load. (``--auto-devices`` may help you run the model, I don't know.) => 7B 32k is ~ 13b 4k RAM/VRAM usage in GGUF version.
        - That model is bug/broken.😏
 - Bigger model will have more information that you need for your Character Card.
 - Best GGUF version that you should run (balance speed/performance): Q4_K_M, Q5_K_M (Slower than Q4)

 - The Context Length affect too much to your Memory. Let's say I have 16GB Vram card, I can run the model in 2 ways, using Text-Generation-WebUI:
   1. Inference: download the origin model, apply args: ``--load-in-4bit --use_double_quant``. I can run all of my model in leaderboard. The bigger parameter is, the slower token can generate. (Ex:7B model could run in 15 token/s, since 3x7b model could only run in ~4-5 token/s)
   2. GGUF Quantization (Fastest,cheapest way to run): After you downloaded GGUF version of those models, sometimes, you can't run it although you can run other model that have bigger parameter. That because:
+       - The context length: 16GB VRAM GPU could run maximum 2x10.7B (~ 19.2B) model with 4k context length. HyouKan is 3x7B(~ 18.5B) parameter, but have 8k(or 32k) context length that need a lot of RAM/VRAM to load. (``--auto-devices`` may help you run the model, I don't know.) => 7B 32k is ~ 13b 4k RAM/VRAM usage in GGUF version.
        - That model is bug/broken.😏
 - Bigger model will have more information that you need for your Character Card.
 - Best GGUF version that you should run (balance speed/performance): Q4_K_M, Q5_K_M (Slower than Q4)