Commit ·
3516ea8
1
Parent(s): bf6941f
Add sampling settings, KV cache benchmarks, and temp warning
Browse files
README.md
CHANGED
|
@@ -47,6 +47,32 @@ All major weight matrices are adapted:
|
|
| 47 |
|
| 48 |
Final training loss: ~0.94 (average: 1.268), decreasing steadily over training.
|
| 49 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
## Usage
|
| 51 |
|
| 52 |
### With PEFT
|
|
|
|
| 47 |
|
| 48 |
Final training loss: ~0.94 (average: 1.268), decreasing steadily over training.
|
| 49 |
|
| 50 |
+
## Recommended Sampling Settings
|
| 51 |
+
|
| 52 |
+
These settings were validated through testing with [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) and [Kronk](https://github.com/danielcherubini/kronk) on an RTX 3080 10GB.
|
| 53 |
+
|
| 54 |
+
| Profile | temperature | top_k | top_p | min_p | presence_penalty |
|
| 55 |
+
|---------|-------------|-------|-------|-------|-----------------|
|
| 56 |
+
| **Coding** | 0.6 | 20 | 0.95 | 0.0 | 0.0 |
|
| 57 |
+
| **Chat** | 1.0 | 20 | 0.95 | 0.0 | 1.5 |
|
| 58 |
+
|
| 59 |
+
> [!WARNING]
|
| 60 |
+
> **Do not use temperature below 0.5** — low temperatures (e.g., 0.3) cause deterministic looping in multi-turn agentic use, where the model repeats the same tool call indefinitely.
|
| 61 |
+
|
| 62 |
+
### KV Cache Quantization
|
| 63 |
+
|
| 64 |
+
For VRAM-constrained GPUs, use quantized KV cache keys/values:
|
| 65 |
+
|
| 66 |
+
| Context Length | KV Cache | VRAM (Q4_K_M) | Generation Speed |
|
| 67 |
+
|---------------|----------|---------------|-----------------|
|
| 68 |
+
| 102,400 | f16/q4_0 | ~8.5 GB | ~111 tok/s |
|
| 69 |
+
| 131,072 | f16/q4_0 | ~9.1 GB | ~110 tok/s |
|
| 70 |
+
|
| 71 |
+
```bash
|
| 72 |
+
# llama.cpp / ik_llama.cpp flags
|
| 73 |
+
-ctk f16 -ctv q4_0
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
## Usage
|
| 77 |
|
| 78 |
### With PEFT
|