Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -140,17 +140,32 @@ model/
|
|
| 140 |
README.md — This file
|
| 141 |
```
|
| 142 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 143 |
## Usage
|
| 144 |
|
| 145 |
```bash
|
| 146 |
-
#
|
| 147 |
-
qora.exe --load model.qora --prompt "What is
|
|
|
|
|
|
|
|
|
|
| 148 |
|
| 149 |
-
#
|
| 150 |
-
qora.exe --load model.qora --prompt "
|
| 151 |
|
| 152 |
-
#
|
| 153 |
-
qora.exe --load model.qora --prompt "
|
| 154 |
|
| 155 |
# Control output length
|
| 156 |
qora.exe --load model.qora --prompt "Tell me a story" --max-tokens 512
|
|
@@ -163,28 +178,40 @@ qora.exe --load model.qora --prompt "Once upon a time" --raw --max-tokens 128
|
|
| 163 |
|
| 164 |
| Flag | Default | Description |
|
| 165 |
|------|---------|-------------|
|
| 166 |
-
| `--load <path>` | — | Load from .qora binary (fast, ~
|
| 167 |
| `--model-path <path>` | `.` | Path to safetensors model directory |
|
| 168 |
| `--prompt <text>` | "Hello, how are you?" | Input prompt |
|
| 169 |
-
| `--max-tokens <n>` |
|
| 170 |
-
| `--no-think` | off | Disable
|
| 171 |
-
| `--greedy` | off | Greedy decoding (temperature=0) |
|
|
|
|
| 172 |
| `--raw` | off | Raw text completion (no chat template) |
|
| 173 |
| `--f16` | off | Use F16 weights instead of Q4 |
|
| 174 |
| `--save <path>` | — | Save model as .qora binary |
|
| 175 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 176 |
## Performance Benchmarks
|
| 177 |
|
| 178 |
**Test Hardware:** Windows 11, CPU-only (no GPU acceleration)
|
| 179 |
|
| 180 |
### Inference Speed
|
| 181 |
|
|
|
|
|
|
|
| 182 |
| Metric | Value |
|
| 183 |
|--------|-------|
|
| 184 |
-
| **Model Load (binary)** | ~
|
| 185 |
-
| **Prefill Speed** | ~
|
| 186 |
-
| **Decode Speed
|
| 187 |
-
| **Decode
|
| 188 |
| **Memory (Q4)** | 1,681 MB |
|
| 189 |
| **Memory (F16)** | ~6,000 MB |
|
| 190 |
|
|
@@ -465,7 +492,8 @@ QORA uses symmetric 4-bit quantization with group_size=32:
|
|
| 465 |
|-----------|---------|-------------|
|
| 466 |
| Temperature | 0.6 | Controls randomness (0 = greedy) |
|
| 467 |
| Top-P | 0.95 | Nucleus sampling threshold |
|
| 468 |
-
|
|
|
|
|
| 469 |
|
| 470 |
## QORA Model Family
|
| 471 |
|
|
|
|
| 140 |
README.md — This file
|
| 141 |
```
|
| 142 |
|
| 143 |
+
## Quick Start
|
| 144 |
+
|
| 145 |
+
For the **fastest results**, use `--no-think --greedy`:
|
| 146 |
+
|
| 147 |
+
```bash
|
| 148 |
+
.\qora.exe --load model.qora --prompt "What is X?" --no-think --greedy
|
| 149 |
+
```
|
| 150 |
+
|
| 151 |
+
This skips the thinking phase and uses deterministic decoding — you get a direct answer immediately.
|
| 152 |
+
|
| 153 |
+
> **Tip:** Think mode produces better answers for complex questions (math, coding, reasoning) but uses 100-300+ tokens just for thinking before the answer appears. On CPU this can take several minutes. For simple factual questions, `--no-think` is much faster.
|
| 154 |
+
|
| 155 |
## Usage
|
| 156 |
|
| 157 |
```bash
|
| 158 |
+
# Fastest: direct answer, no thinking, deterministic
|
| 159 |
+
qora.exe --load model.qora --prompt "What is the capital of France?" --no-think --greedy
|
| 160 |
+
|
| 161 |
+
# Fast: direct answer with some randomness
|
| 162 |
+
qora.exe --load model.qora --prompt "Tell me about Mars" --no-think
|
| 163 |
|
| 164 |
+
# Full quality: thinking mode (slower but better for complex questions)
|
| 165 |
+
qora.exe --load model.qora --prompt "Solve: if x^2 + 3x = 10, what is x?" --max-tokens 1024
|
| 166 |
|
| 167 |
+
# See what the model is thinking
|
| 168 |
+
qora.exe --load model.qora --prompt "What is 2+2?" --show-think
|
| 169 |
|
| 170 |
# Control output length
|
| 171 |
qora.exe --load model.qora --prompt "Tell me a story" --max-tokens 512
|
|
|
|
| 178 |
|
| 179 |
| Flag | Default | Description |
|
| 180 |
|------|---------|-------------|
|
| 181 |
+
| `--load <path>` | — | Load from .qora binary (fast, ~2-5s) |
|
| 182 |
| `--model-path <path>` | `.` | Path to safetensors model directory |
|
| 183 |
| `--prompt <text>` | "Hello, how are you?" | Input prompt |
|
| 184 |
+
| `--max-tokens <n>` | 1024 | Maximum tokens to generate |
|
| 185 |
+
| `--no-think` | off | Disable thinking mode (faster, direct answers) |
|
| 186 |
+
| `--greedy` | off | Greedy decoding (temperature=0, deterministic) |
|
| 187 |
+
| `--show-think` | off | Display thinking content on stderr |
|
| 188 |
| `--raw` | off | Raw text completion (no chat template) |
|
| 189 |
| `--f16` | off | Use F16 weights instead of Q4 |
|
| 190 |
| `--save <path>` | — | Save model as .qora binary |
|
| 191 |
|
| 192 |
+
### Speed Tips
|
| 193 |
+
|
| 194 |
+
| Mode | Speed | Best For |
|
| 195 |
+
|------|-------|----------|
|
| 196 |
+
| `--no-think --greedy` | ~1 tok/s | Fastest. Simple factual questions. |
|
| 197 |
+
| `--no-think` | ~1 tok/s | Fast with variety. General questions. |
|
| 198 |
+
| `--show-think` | ~1 tok/s | See reasoning. Complex questions. |
|
| 199 |
+
| *(default think mode)* | ~1 tok/s | Best quality but thinking uses 100-300+ tokens before answer appears. Use `--max-tokens 1024` or higher. |
|
| 200 |
+
|
| 201 |
## Performance Benchmarks
|
| 202 |
|
| 203 |
**Test Hardware:** Windows 11, CPU-only (no GPU acceleration)
|
| 204 |
|
| 205 |
### Inference Speed
|
| 206 |
|
| 207 |
+
Tested on i5-11500 (6C/12T, AVX-512), 16GB RAM, Windows 11.
|
| 208 |
+
|
| 209 |
| Metric | Value |
|
| 210 |
|--------|-------|
|
| 211 |
+
| **Model Load (binary)** | ~3-17s (varies with disk cache) |
|
| 212 |
+
| **Prefill Speed** | ~1.3-2.2 tok/s |
|
| 213 |
+
| **Decode Speed** | ~1.0 tok/s (Q4, multi-threaded GEMV) |
|
| 214 |
+
| **Single Decode Step** | ~530ms (warm benchmark) |
|
| 215 |
| **Memory (Q4)** | 1,681 MB |
|
| 216 |
| **Memory (F16)** | ~6,000 MB |
|
| 217 |
|
|
|
|
| 492 |
|-----------|---------|-------------|
|
| 493 |
| Temperature | 0.6 | Controls randomness (0 = greedy) |
|
| 494 |
| Top-P | 0.95 | Nucleus sampling threshold |
|
| 495 |
+
| Repetition Penalty | 1.1 | Discourages repeating recent tokens |
|
| 496 |
+
| Max Tokens | 1024 | Maximum generation length |
|
| 497 |
|
| 498 |
## QORA Model Family
|
| 499 |
|