Qwen3-0.6B Summarizer
A one-sentence summarizer fine-tuned from Qwen3-0.6B using LoRA distillation. Feed it any text, get back a concise one-sentence summary.
Trained by distilling 6,720 high-quality summaries generated by Gemini Flash into Qwen3-0.6B. The model learns to compress markdown text (chat logs, task descriptions, bug reports, planning notes) into clear, information-dense one-liners.
Example
Input: "Eric wants to reorganize how Cloud Eric handles project planning loops.
Currently the planning task runs every 30 minutes and creates sub-tasks,
but it often creates duplicates because it does not check what tasks are
already running or what PRs are already open. The fix should add a dedup
check that reviews pending tasks and recent GitHub PRs before creating
anything new."
Output: "Cloud Eric planning bug fix โ current task creates duplicates because
it lacks a dedup check for pending tasks and open PRs."
375 characters in, 124 out โ 67% compression while preserving the root cause and fix.
Quick Start
With llama-cpp-python (CPU, no GPU needed)
from llama_cpp import Llama
llm = Llama("qwen3-0.6b-summarizer-q8_0.gguf", n_ctx=512, n_threads=8, verbose=False)
text = "Your text to summarize here..."
prompt = f"Summarize in one sentence:\n{text}\n\nSummary:"
output = llm(prompt, max_tokens=80, temperature=0.3, stop=["\n", "<|im_end|>"])
print(output["choices"][0]["text"])
With llama.cpp CLI
./llama-cli -m qwen3-0.6b-summarizer-q8_0.gguf \
-p "Summarize in one sentence:\nYour text here\n\nSummary:" \
-n 80 --temp 0.3
With transformers (GPU)
Apply the LoRA weights to the base model:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load base model
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
# Load and apply LoRA weights (see lora_weights/best_distill.pt)
# Or use the pre-merged GGUF files directly with llama-cpp-python
text = "Your text here"
prompt = f"Summarize in one sentence:\n{text}\n\nSummary:"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=80, temperature=0.3)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Files
| File | Size | Description |
|---|---|---|
qwen3-0.6b-summarizer-q8_0.gguf |
610 MB | Recommended. Q8_0 quantized, best speed/quality tradeoff. |
qwen3-0.6b-summarizer-f16.gguf |
1.1 GB | Full F16 precision. Slightly better quality, slower. |
lora_weights/best_distill.pt |
8.8 MB | Raw LoRA weights (PyTorch). Apply to base Qwen3-0.6B. |
training/training_metrics.json |
789 KB | Full training metrics (per-step loss, LR, grad norms). |
training/training_charts.png |
157 KB | Training loss curves visualization. |
training/gpu_distill.py |
18 KB | Training script (for reproduction). |
training/merge_and_export.py |
10 KB | LoRA merge + GGUF export script. |
Performance
| Metric | Value |
|---|---|
| Inference speed (CPU, 8 threads) | 3-5 seconds per summary (~7 tok/s) |
| Inference speed (GPU) | <0.5 seconds per summary |
| Model load time | 0.6s (Q8_0) |
| Average output length | ~30 tokens |
| Max recommended input | ~2,000 characters |
| Validation loss | 1.136 |
Training Details
Method
LoRA distillation from Gemini 3 Flash Preview outputs. The base Qwen3-0.6B is frozen; only LoRA adapters (rank=16, alpha=32) on q_proj and v_proj across all 28 attention layers are trained.
- Training data: 6,720 (text, summary) pairs. Text is markdown content from a personal knowledge management system (chat logs, task descriptions, project notes, bug reports). Summaries were generated by Gemini 3 Flash Preview.
- Split: 6,048 train / 672 validation
- Optimizer: AdamW, lr=2e-4, weight_decay=0.01, cosine schedule
- Mixed precision: float16 with GradScaler
- Batch size: 8
- Epochs: 5 (best at epoch 3)
- Hardware: NVIDIA L4 (24GB) on RunPod
- Training time: 31 minutes
- Training cost: ~$0.20
Prompt Format
The model was trained with this exact prompt template:
Summarize in one sentence:
{text}
Summary:
Use this format for best results. The model outputs a single sentence and stops.
Training Curves
Best validation loss of 1.136 at epoch 3. Mild overfitting begins at epoch 4.
Why Distillation?
We tried five approaches before settling on distillation:
| Approach | Val Loss | Quality |
|---|---|---|
| Prefix tuning (embedding โ soft tokens) | 1.15-1.24 | Hallucinated entity names |
| LoRA + embedding projection | 1.14-1.16 | Better but still imprecise on details |
| Text distillation (this model) | 1.14 | Near-verbatim reproduction |
The key insight: embedding vectors don't encode specific details (PR numbers, app names, exact error messages). By training directly on raw text, the model can see and reproduce those details. The distillation approach produces summaries that are nearly indistinguishable from the Gemini originals.
Sample Generations
From the validation set (unseen during training):
| Reference (Gemini) | Generated (This Model) |
|---|---|
| Batch deployment โ merged seven branches into clouderic including FolkReel query fixes, desktop/mobi... | Batch deployment โ merged seven branches for clouderic including folkreel query field fixes, desktop... |
| FolkReel planning cycle โ reviewed project state and strategic focus on AI-led interview elicitation... | FolkReel planning cycle โ reviewed project state to prioritize interview elicitation and the iterati... |
| Bug fix for Claude configuration error โ adding symlink-on-startup logic to the Go binary to ensure... | Claude configuration fix โ adding symlink-on-startup logic to Go binary and running tests to resolv... |
| WebUI Next planning cycle โ enforcing a tight 10-minute loop to address subpar product quality... | WebUI Next planning cycle โ enforcing a short cadence to address poor visual design and interaction... |
Limitations
- Domain-specific: Trained on software engineering/devops content. Will work on general text but style is tuned for technical summaries.
- Single sentence: Always outputs one sentence. Not suitable for multi-paragraph summarization.
- English only: Trained exclusively on English text.
- Max input ~2K chars: Longer texts get truncated. For very long documents, consider chunking.
- No thinking/reasoning: This is a distilled model โ it pattern-matches rather than reasons about content.
License
Apache 2.0 (same as the base Qwen3-0.6B model).
Acknowledgments
- Qwen3-0.6B by Alibaba Cloud โ the base model
- Gemini 3 Flash Preview by Google โ generated the training summaries
- llama.cpp โ GGUF format and CPU inference
- RunPod โ GPU training infrastructure
- Built as part of the Cloud Eric project
- Downloads last month
- 52
8-bit
16-bit
Model tree for ericflo/qwen3-0.6b-summarizer
Evaluation results
- Validation Lossself-reported1.136
