drdraq commited on
Commit
6902aea
·
verified ·
1 Parent(s): 107991a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +43 -15
README.md CHANGED
@@ -140,17 +140,32 @@ model/
140
  README.md — This file
141
  ```
142
 
 
 
 
 
 
 
 
 
 
 
 
 
143
  ## Usage
144
 
145
  ```bash
146
- # Basic chat (with thinking mode)
147
- qora.exe --load model.qora --prompt "What is photosynthesis?"
 
 
 
148
 
149
- # Direct answer (no thinking)
150
- qora.exe --load model.qora --prompt "What is the capital of France?" --no-think
151
 
152
- # Greedy decoding (deterministic)
153
- qora.exe --load model.qora --prompt "Write hello world in Python" --greedy --no-think
154
 
155
  # Control output length
156
  qora.exe --load model.qora --prompt "Tell me a story" --max-tokens 512
@@ -163,28 +178,40 @@ qora.exe --load model.qora --prompt "Once upon a time" --raw --max-tokens 128
163
 
164
  | Flag | Default | Description |
165
  |------|---------|-------------|
166
- | `--load <path>` | — | Load from .qora binary (fast, ~2s) |
167
  | `--model-path <path>` | `.` | Path to safetensors model directory |
168
  | `--prompt <text>` | "Hello, how are you?" | Input prompt |
169
- | `--max-tokens <n>` | 256 | Maximum tokens to generate |
170
- | `--no-think` | off | Disable reasoning/thinking mode |
171
- | `--greedy` | off | Greedy decoding (temperature=0) |
 
172
  | `--raw` | off | Raw text completion (no chat template) |
173
  | `--f16` | off | Use F16 weights instead of Q4 |
174
  | `--save <path>` | — | Save model as .qora binary |
175
 
 
 
 
 
 
 
 
 
 
176
  ## Performance Benchmarks
177
 
178
  **Test Hardware:** Windows 11, CPU-only (no GPU acceleration)
179
 
180
  ### Inference Speed
181
 
 
 
182
  | Metric | Value |
183
  |--------|-------|
184
- | **Model Load (binary)** | ~2-5s (single instance) |
185
- | **Prefill Speed** | ~0.5 tok/s (123 tokens in ~270s) |
186
- | **Decode Speed (warm)** | ~3.7s per token (single decode) |
187
- | **Decode Throughput** | 0.20-0.29 tok/s (sustained generation) |
188
  | **Memory (Q4)** | 1,681 MB |
189
  | **Memory (F16)** | ~6,000 MB |
190
 
@@ -465,7 +492,8 @@ QORA uses symmetric 4-bit quantization with group_size=32:
465
  |-----------|---------|-------------|
466
  | Temperature | 0.6 | Controls randomness (0 = greedy) |
467
  | Top-P | 0.95 | Nucleus sampling threshold |
468
- | Max Tokens | 256 | Maximum generation length |
 
469
 
470
  ## QORA Model Family
471
 
 
140
  README.md — This file
141
  ```
142
 
143
+ ## Quick Start
144
+
145
+ For the **fastest results**, use `--no-think --greedy`:
146
+
147
+ ```bash
148
+ .\qora.exe --load model.qora --prompt "What is X?" --no-think --greedy
149
+ ```
150
+
151
+ This skips the thinking phase and uses deterministic decoding — you get a direct answer immediately.
152
+
153
+ > **Tip:** Think mode produces better answers for complex questions (math, coding, reasoning) but uses 100-300+ tokens just for thinking before the answer appears. On CPU this can take several minutes. For simple factual questions, `--no-think` is much faster.
154
+
155
  ## Usage
156
 
157
  ```bash
158
+ # Fastest: direct answer, no thinking, deterministic
159
+ qora.exe --load model.qora --prompt "What is the capital of France?" --no-think --greedy
160
+
161
+ # Fast: direct answer with some randomness
162
+ qora.exe --load model.qora --prompt "Tell me about Mars" --no-think
163
 
164
+ # Full quality: thinking mode (slower but better for complex questions)
165
+ qora.exe --load model.qora --prompt "Solve: if x^2 + 3x = 10, what is x?" --max-tokens 1024
166
 
167
+ # See what the model is thinking
168
+ qora.exe --load model.qora --prompt "What is 2+2?" --show-think
169
 
170
  # Control output length
171
  qora.exe --load model.qora --prompt "Tell me a story" --max-tokens 512
 
178
 
179
  | Flag | Default | Description |
180
  |------|---------|-------------|
181
+ | `--load <path>` | — | Load from .qora binary (fast, ~2-5s) |
182
  | `--model-path <path>` | `.` | Path to safetensors model directory |
183
  | `--prompt <text>` | "Hello, how are you?" | Input prompt |
184
+ | `--max-tokens <n>` | 1024 | Maximum tokens to generate |
185
+ | `--no-think` | off | Disable thinking mode (faster, direct answers) |
186
+ | `--greedy` | off | Greedy decoding (temperature=0, deterministic) |
187
+ | `--show-think` | off | Display thinking content on stderr |
188
  | `--raw` | off | Raw text completion (no chat template) |
189
  | `--f16` | off | Use F16 weights instead of Q4 |
190
  | `--save <path>` | — | Save model as .qora binary |
191
 
192
+ ### Speed Tips
193
+
194
+ | Mode | Speed | Best For |
195
+ |------|-------|----------|
196
+ | `--no-think --greedy` | ~1 tok/s | Fastest. Simple factual questions. |
197
+ | `--no-think` | ~1 tok/s | Fast with variety. General questions. |
198
+ | `--show-think` | ~1 tok/s | See reasoning. Complex questions. |
199
+ | *(default think mode)* | ~1 tok/s | Best quality but thinking uses 100-300+ tokens before answer appears. Use `--max-tokens 1024` or higher. |
200
+
201
  ## Performance Benchmarks
202
 
203
  **Test Hardware:** Windows 11, CPU-only (no GPU acceleration)
204
 
205
  ### Inference Speed
206
 
207
+ Tested on i5-11500 (6C/12T, AVX-512), 16GB RAM, Windows 11.
208
+
209
  | Metric | Value |
210
  |--------|-------|
211
+ | **Model Load (binary)** | ~3-17s (varies with disk cache) |
212
+ | **Prefill Speed** | ~1.3-2.2 tok/s |
213
+ | **Decode Speed** | ~1.0 tok/s (Q4, multi-threaded GEMV) |
214
+ | **Single Decode Step** | ~530ms (warm benchmark) |
215
  | **Memory (Q4)** | 1,681 MB |
216
  | **Memory (F16)** | ~6,000 MB |
217
 
 
492
  |-----------|---------|-------------|
493
  | Temperature | 0.6 | Controls randomness (0 = greedy) |
494
  | Top-P | 0.95 | Nucleus sampling threshold |
495
+ | Repetition Penalty | 1.1 | Discourages repeating recent tokens |
496
+ | Max Tokens | 1024 | Maximum generation length |
497
 
498
  ## QORA Model Family
499