danielcherubini commited on
Commit
3516ea8
·
1 Parent(s): bf6941f

Add sampling settings, KV cache benchmarks, and temp warning

Browse files
Files changed (1) hide show
  1. README.md +26 -0
README.md CHANGED
@@ -47,6 +47,32 @@ All major weight matrices are adapted:
47
 
48
  Final training loss: ~0.94 (average: 1.268), decreasing steadily over training.
49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  ## Usage
51
 
52
  ### With PEFT
 
47
 
48
  Final training loss: ~0.94 (average: 1.268), decreasing steadily over training.
49
 
50
+ ## Recommended Sampling Settings
51
+
52
+ These settings were validated through testing with [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) and [Kronk](https://github.com/danielcherubini/kronk) on an RTX 3080 10GB.
53
+
54
+ | Profile | temperature | top_k | top_p | min_p | presence_penalty |
55
+ |---------|-------------|-------|-------|-------|-----------------|
56
+ | **Coding** | 0.6 | 20 | 0.95 | 0.0 | 0.0 |
57
+ | **Chat** | 1.0 | 20 | 0.95 | 0.0 | 1.5 |
58
+
59
+ > [!WARNING]
60
+ > **Do not use temperature below 0.5** — low temperatures (e.g., 0.3) cause deterministic looping in multi-turn agentic use, where the model repeats the same tool call indefinitely.
61
+
62
+ ### KV Cache Quantization
63
+
64
+ For VRAM-constrained GPUs, use quantized KV cache keys/values:
65
+
66
+ | Context Length | KV Cache | VRAM (Q4_K_M) | Generation Speed |
67
+ |---------------|----------|---------------|-----------------|
68
+ | 102,400 | f16/q4_0 | ~8.5 GB | ~111 tok/s |
69
+ | 131,072 | f16/q4_0 | ~9.1 GB | ~110 tok/s |
70
+
71
+ ```bash
72
+ # llama.cpp / ik_llama.cpp flags
73
+ -ctk f16 -ctv q4_0
74
+ ```
75
+
76
  ## Usage
77
 
78
  ### With PEFT