qoranet
/

QORA-LLM

@@ -140,17 +140,32 @@ model/
   README.md         — This file
 ```
 ## Usage
 ```bash
-# Basic chat (with thinking mode)
-qora.exe --load model.qora --prompt "What is photosynthesis?"
-# Direct answer (no thinking)
-qora.exe --load model.qora --prompt "What is the capital of France?" --no-think
-# Greedy decoding (deterministic)
-qora.exe --load model.qora --prompt "Write hello world in Python" --greedy --no-think
 # Control output length
 qora.exe --load model.qora --prompt "Tell me a story" --max-tokens 512
@@ -163,28 +178,40 @@ qora.exe --load model.qora --prompt "Once upon a time" --raw --max-tokens 128
 | Flag | Default | Description |
 |------|---------|-------------|
-| `--load <path>` | — | Load from .qora binary (fast, ~2s) |
 | `--model-path <path>` | `.` | Path to safetensors model directory |
 | `--prompt <text>` | "Hello, how are you?" | Input prompt |
-| `--max-tokens <n>` | 256 | Maximum tokens to generate |
-| `--no-think` | off | Disable reasoning/thinking mode |
-| `--greedy` | off | Greedy decoding (temperature=0) |
 | `--raw` | off | Raw text completion (no chat template) |
 | `--f16` | off | Use F16 weights instead of Q4 |
 | `--save <path>` | — | Save model as .qora binary |
 ## Performance Benchmarks
 **Test Hardware:** Windows 11, CPU-only (no GPU acceleration)
 ### Inference Speed
 | Metric | Value |
 |--------|-------|
-| **Model Load (binary)** | ~2-5s (single instance) |
-| **Prefill Speed** | ~0.5 tok/s (123 tokens in ~270s) |
-| **Decode Speed (warm)** | ~3.7s per token (single decode) |
-| **Decode Throughput** | 0.20-0.29 tok/s (sustained generation) |
 | **Memory (Q4)** | 1,681 MB |
 | **Memory (F16)** | ~6,000 MB |
@@ -465,7 +492,8 @@ QORA uses symmetric 4-bit quantization with group_size=32:
 |-----------|---------|-------------|
 | Temperature | 0.6 | Controls randomness (0 = greedy) |
 | Top-P | 0.95 | Nucleus sampling threshold |
-| Max Tokens | 256 | Maximum generation length |
 ## QORA Model Family

   README.md         — This file
 ```
+## Quick Start
+For the **fastest results**, use `--no-think --greedy`:
+```bash
+.\qora.exe --load model.qora --prompt "What is X?" --no-think --greedy
+```
+This skips the thinking phase and uses deterministic decoding — you get a direct answer immediately.
+> **Tip:** Think mode produces better answers for complex questions (math, coding, reasoning) but uses 100-300+ tokens just for thinking before the answer appears. On CPU this can take several minutes. For simple factual questions, `--no-think` is much faster.
 ## Usage
 ```bash
+# Fastest: direct answer, no thinking, deterministic
+qora.exe --load model.qora --prompt "What is the capital of France?" --no-think --greedy
+# Fast: direct answer with some randomness
+qora.exe --load model.qora --prompt "Tell me about Mars" --no-think
+# Full quality: thinking mode (slower but better for complex questions)
+qora.exe --load model.qora --prompt "Solve: if x^2 + 3x = 10, what is x?" --max-tokens 1024
+# See what the model is thinking
+qora.exe --load model.qora --prompt "What is 2+2?" --show-think
 # Control output length
 qora.exe --load model.qora --prompt "Tell me a story" --max-tokens 512
 | Flag | Default | Description |
 |------|---------|-------------|
+| `--load <path>` | — | Load from .qora binary (fast, ~2-5s) |
 | `--model-path <path>` | `.` | Path to safetensors model directory |
 | `--prompt <text>` | "Hello, how are you?" | Input prompt |
+| `--max-tokens <n>` | 1024 | Maximum tokens to generate |
+| `--no-think` | off | Disable thinking mode (faster, direct answers) |
+| `--greedy` | off | Greedy decoding (temperature=0, deterministic) |
+| `--show-think` | off | Display thinking content on stderr |
 | `--raw` | off | Raw text completion (no chat template) |
 | `--f16` | off | Use F16 weights instead of Q4 |
 | `--save <path>` | — | Save model as .qora binary |
+### Speed Tips
+| Mode | Speed | Best For |
+|------|-------|----------|
+| `--no-think --greedy` | ~1 tok/s | Fastest. Simple factual questions. |
+| `--no-think` | ~1 tok/s | Fast with variety. General questions. |
+| `--show-think` | ~1 tok/s | See reasoning. Complex questions. |
+| *(default think mode)* | ~1 tok/s | Best quality but thinking uses 100-300+ tokens before answer appears. Use `--max-tokens 1024` or higher. |
 ## Performance Benchmarks
 **Test Hardware:** Windows 11, CPU-only (no GPU acceleration)
 ### Inference Speed
+Tested on i5-11500 (6C/12T, AVX-512), 16GB RAM, Windows 11.
 | Metric | Value |
 |--------|-------|
+| **Model Load (binary)** | ~3-17s (varies with disk cache) |
+| **Prefill Speed** | ~1.3-2.2 tok/s |
+| **Decode Speed** | ~1.0 tok/s (Q4, multi-threaded GEMV) |
+| **Single Decode Step** | ~530ms (warm benchmark) |
 | **Memory (Q4)** | 1,681 MB |
 | **Memory (F16)** | ~6,000 MB |
 |-----------|---------|-------------|
 | Temperature | 0.6 | Controls randomness (0 = greedy) |
 | Top-P | 0.95 | Nucleus sampling threshold |
+| Repetition Penalty | 1.1 | Discourages repeating recent tokens |
+| Max Tokens | 1024 | Maximum generation length |
 ## QORA Model Family