DavidAU
/

Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters

@@ -21,7 +21,9 @@ This settings can also fix a number of model issues such as:
 - "Gibberish"
 - letter, word, phrase, paragraph repeats
 - coherence
 - creativeness or lack there of or .. too much - purple prose.
 Likewise setting can also improve model generation and/or general overall "smoothness" / "quality" of model operation.
@@ -87,13 +89,13 @@ Note that https://github.com/LostRuins/koboldcpp also allows access to all LLAMA
 Other programs like https://www.LMStudio.ai allows access to most of STANDARD samplers, where as others (llamacpp only here) you may need to add to the json file(s) for a model and/or template preset.
-In most cases all llama_cpp settings are available when using API / headless / server mode in "text-generation-webui", "koboldcpp" and "lmstudio" (as well as other apps too).
 You can also use llama_cpp directly too. (IE: llama-server.exe) ; see :
 https://github.com/ggerganov/llama.cpp
-(scroll down on the main page for more apps/programs to use GGUFs too)
 ---
@@ -112,7 +114,7 @@ for chat / role play and/or other use case(s). Generally speaking, this helps th
 Class 4 are balanced on the very edge of stability. These models are generally highly creative, for very narrow use case(s), and closer to "human prose" than other models. With these models, advanced samplers
 are used to "bring these bad boys" inline which is especially important for chat and/or role play type use cases AND/OR use case(s) these models were not designed for.
-The goal here is to use parameters to raise/lower the power of the model and samplers to "prune" (or in some cases enhance) operation.
 With that being said, generation "examples" (at my repo) are created using the "Primary Testing Parameters" (top of this document) settings regardless of the "class" of the model AND NO advanced settings, or samplers.
@@ -131,6 +133,8 @@ Generally it is recommended to run the highest quant(s) you can on your machine
 The smaller the size of model, the greater the contrast between the smallest quant and largest quant in terms of operation, quality, nuance and general overall function.
 Imatrix quants generally improve all quants, and also allow you to use smaller quants (less memory, more context space) and retain quality of operation.
 IE: Instead of using a q4KM, you might be able to run an IQ3_M and get close to Q4KM's quality, but at a higher token per second speed and have more VRAM for context.
@@ -142,8 +146,9 @@ PRIMARY PARAMETERS:
 These parameters will have SIGNIFICANT effect on prose, generation, length and content; with temp being the most powerful.
-Keep in mind the biggest parameter / random "unknown" is your prompt. A word change, rephrasing, punctation , even a comma, or semi-colon can drastically alter the
-output, even at min temp settings. CAPS also affect generation too.
 <B>temp  /  temperature</B>
@@ -183,7 +188,10 @@ Bring this up to 80-120 for a lot more word choice, and below 40 for simpler wor
 NOTES:
-For an interesting test, set "temp" to 0 ; this will give you the SAME generation for a given prompt each time. Then adjust a word, phrase, sentence etc - to see the differences.
 Keep in mind this will show model operation at its LEAST powerful/creative level and should NOT be used to determine if the model works for your use case(s).
 Then test "at temp" to see the model in action. (5-10 generations recommended)
@@ -199,7 +207,7 @@ Then test "at temp" to see the MODELS in action. (5-10 generations recommended)
 PENALITY SAMPLERS:
 ------------------------------------------------------------------------------
-These samplers "trim" or "prune" output.
 PRIMARY:
@@ -208,7 +216,11 @@ PRIMARY:
 last n tokens to consider for penalize (default: 64, 0 = disabled, -1	= ctx_size)
 ("repetition_penalty_range" in oobabooga/text-generation-webui , "rp_range" in kobold)
-THIS IS CRITICAL. Too high you can get all kinds of issues (repeat words, sentences, paragraphs or "gibberish"), especially with class 3 or 4 models.
 This setting also works in conjunction with all other "rep pens" below.
@@ -223,18 +235,18 @@ penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled)
 Generally this is set from 1.0 to 1.15 ; smallest increments are best IE: 1.01... 1,.02 or even 1.001... 1.002.
-This affects creativity of the model over all , not just how words are penalized.
 <B>presence-penalty</B>
 repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
-Generally leave this at zero IF repeat-last-n is 256 or less. You may want to use this for higher repeat-last-n settings.
-CLASS 3: 0.05 may assist generation BUT SET "repeat-last-n" to 512 or less. Better is 128 or 64.
-CLASS 4: 0.1 to 0.25 may assist generation BUT SET "repeat-last-n" to 64
 <B>frequency-penalty</B>
@@ -245,7 +257,7 @@ Generally leave this at zero IF repeat-last-n is 512 or less. You may want to us
 CLASS 3: 0.25 may assist generation BUT SET "repeat-last-n" to 512 or less. Better is 128 or 64.
-CLASS 4: 0.7 to 0.8 may assist generation BUT SET "repeat-last-n" to 64.
 <B>penalize-nl  </B>
@@ -295,14 +307,17 @@ mirostat_tau: 5-8 is a good value.
 mirostat_eta: 0.1 is a good value.
-This is the big one ; activating this will help with creative generation. It can also help with stability.
 This is both a sampler (and pruner) and enhancement all in one.
 For Class 3 models it is suggested to use this to assist with generation (min settings).
-For Class 4 models it is highly recommended with Microstat 1 or 2 + mirostat-lr @ 6 to 8 and mirostat_eta at .1 to .5
 <B>dynatemp-range</B>
@@ -466,3 +481,4 @@ Smaller quants may require STRONGER settings (all classes of models) due to comp
 This is also influenced by the parameter size of the model in relation to the quant size.
 IE: a 8B model at Q2K will be far more unstable relative to a 20B model at Q2K, and as a result require stronger settings.

 - "Gibberish"
 - letter, word, phrase, paragraph repeats
 - coherence
+- instruction following
 - creativeness or lack there of or .. too much - purple prose.
+- low quant (ie q2k, iq1s, iq2s) issues.
 Likewise setting can also improve model generation and/or general overall "smoothness" / "quality" of model operation.
 Other programs like https://www.LMStudio.ai allows access to most of STANDARD samplers, where as others (llamacpp only here) you may need to add to the json file(s) for a model and/or template preset.
+In most cases all llama_cpp settings are available when using API / headless / server mode in "text-generation-webui", "koboldcpp", "Olama" and "lmstudio" (as well as other apps too).
 You can also use llama_cpp directly too. (IE: llama-server.exe) ; see :
 https://github.com/ggerganov/llama.cpp
+(scroll down on the main page for more apps/programs to use GGUFs too that connect to / use the LLAMA-CPP package.)
 ---
 Class 4 are balanced on the very edge of stability. These models are generally highly creative, for very narrow use case(s), and closer to "human prose" than other models. With these models, advanced samplers
 are used to "bring these bad boys" inline which is especially important for chat and/or role play type use cases AND/OR use case(s) these models were not designed for.
+The goal here is to use parameters to raise/lower the power of the model and samplers to "prune" (and/or in some cases enhance) operation.
 With that being said, generation "examples" (at my repo) are created using the "Primary Testing Parameters" (top of this document) settings regardless of the "class" of the model AND NO advanced settings, or samplers.
 The smaller the size of model, the greater the contrast between the smallest quant and largest quant in terms of operation, quality, nuance and general overall function.
+IMATRIX:
 Imatrix quants generally improve all quants, and also allow you to use smaller quants (less memory, more context space) and retain quality of operation.
 IE: Instead of using a q4KM, you might be able to run an IQ3_M and get close to Q4KM's quality, but at a higher token per second speed and have more VRAM for context.
 These parameters will have SIGNIFICANT effect on prose, generation, length and content; with temp being the most powerful.
+Keep in mind the biggest parameter / random "unknown" is your prompt.
+A word change, rephrasing, punctation , even a comma, or semi-colon can drastically alter the output, even at min temp settings. CAPS also affect generation too.
 <B>temp  /  temperature</B>
 NOTES:
+For an interesting test, set "temp" to 0 ; this will give you the SAME generation for a given prompt each time.
+Then adjust a word, phrase, sentence etc - to see the differences.
 Keep in mind this will show model operation at its LEAST powerful/creative level and should NOT be used to determine if the model works for your use case(s).
 Then test "at temp" to see the model in action. (5-10 generations recommended)
 PENALITY SAMPLERS:
 ------------------------------------------------------------------------------
+These samplers "trim" or "prune" output in real time. The longer the generation, the stronger overall effect.
 PRIMARY:
 last n tokens to consider for penalize (default: 64, 0 = disabled, -1	= ctx_size)
 ("repetition_penalty_range" in oobabooga/text-generation-webui , "rp_range" in kobold)
+THIS IS CRITICAL.
+Too high you can get all kinds of issues (repeat words, sentences, paragraphs or "gibberish"), especially with class 3 or 4 models.
+Likewise if you change this parameter it will drastically alter the output.
 This setting also works in conjunction with all other "rep pens" below.
 Generally this is set from 1.0 to 1.15 ; smallest increments are best IE: 1.01... 1,.02 or even 1.001... 1.002.
+This affects creativity of the model over all, not just how words are penalized.
 <B>presence-penalty</B>
 repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
+Generally leave this at zero IF repeat-last-n is 512-1024 or less. You may want to use this for higher repeat-last-n settings.
+CLASS 3: 0.05 to .2 may assist generation BUT SET "repeat-last-n" to 512 or less. Better is 128 or 64.
+CLASS 4: 0.1 to 0.35 may assist generation BUT SET "repeat-last-n" to 64.
 <B>frequency-penalty</B>
 CLASS 3: 0.25 may assist generation BUT SET "repeat-last-n" to 512 or less. Better is 128 or 64.
+CLASS 4: 0.4 to 0.8 may assist generation BUT SET "repeat-last-n" to 64.
 <B>penalize-nl  </B>
 mirostat_eta: 0.1 is a good value.
+This is the big one ; activating this will help with creative generation. It can also help with stability. Also note which
+samplers are disable/ignored here, and that "mirostat_eta" is a learning rate.
 This is both a sampler (and pruner) and enhancement all in one.
+It also has two modes of generation "1" and "2" - test both with 5-10 generations of the same prompt. Make adjustments, and repeat.
 For Class 3 models it is suggested to use this to assist with generation (min settings).
+For Class 4 models it is highly recommended with Microstat 1 or 2 + mirostat_tau @ 6 to 8 and mirostat_eta at .1 to .5
 <B>dynatemp-range</B>
 This is also influenced by the parameter size of the model in relation to the quant size.
 IE: a 8B model at Q2K will be far more unstable relative to a 20B model at Q2K, and as a result require stronger settings.