Qwen3-0.6B Summarizer

A one-sentence summarizer fine-tuned from Qwen3-0.6B using LoRA distillation. Feed it any text, get back a concise one-sentence summary.

Trained by distilling 6,720 high-quality summaries generated by Gemini Flash into Qwen3-0.6B. The model learns to compress markdown text (chat logs, task descriptions, bug reports, planning notes) into clear, information-dense one-liners.

Example

Input:  "Eric wants to reorganize how Cloud Eric handles project planning loops.
         Currently the planning task runs every 30 minutes and creates sub-tasks,
         but it often creates duplicates because it does not check what tasks are
         already running or what PRs are already open. The fix should add a dedup
         check that reviews pending tasks and recent GitHub PRs before creating
         anything new."

Output: "Cloud Eric planning bug fix โ€” current task creates duplicates because
         it lacks a dedup check for pending tasks and open PRs."

375 characters in, 124 out โ€” 67% compression while preserving the root cause and fix.

Quick Start

With llama-cpp-python (CPU, no GPU needed)

from llama_cpp import Llama

llm = Llama("qwen3-0.6b-summarizer-q8_0.gguf", n_ctx=512, n_threads=8, verbose=False)

text = "Your text to summarize here..."
prompt = f"Summarize in one sentence:\n{text}\n\nSummary:"

output = llm(prompt, max_tokens=80, temperature=0.3, stop=["\n", "<|im_end|>"])
print(output["choices"][0]["text"])

With llama.cpp CLI

./llama-cli -m qwen3-0.6b-summarizer-q8_0.gguf \
  -p "Summarize in one sentence:\nYour text here\n\nSummary:" \
  -n 80 --temp 0.3

With transformers (GPU)

Apply the LoRA weights to the base model:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load base model
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")

# Load and apply LoRA weights (see lora_weights/best_distill.pt)
# Or use the pre-merged GGUF files directly with llama-cpp-python

text = "Your text here"
prompt = f"Summarize in one sentence:\n{text}\n\nSummary:"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=80, temperature=0.3)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Files

File Size Description
qwen3-0.6b-summarizer-q8_0.gguf 610 MB Recommended. Q8_0 quantized, best speed/quality tradeoff.
qwen3-0.6b-summarizer-f16.gguf 1.1 GB Full F16 precision. Slightly better quality, slower.
lora_weights/best_distill.pt 8.8 MB Raw LoRA weights (PyTorch). Apply to base Qwen3-0.6B.
training/training_metrics.json 789 KB Full training metrics (per-step loss, LR, grad norms).
training/training_charts.png 157 KB Training loss curves visualization.
training/gpu_distill.py 18 KB Training script (for reproduction).
training/merge_and_export.py 10 KB LoRA merge + GGUF export script.

Performance

Metric Value
Inference speed (CPU, 8 threads) 3-5 seconds per summary (~7 tok/s)
Inference speed (GPU) <0.5 seconds per summary
Model load time 0.6s (Q8_0)
Average output length ~30 tokens
Max recommended input ~2,000 characters
Validation loss 1.136

Training Details

Method

LoRA distillation from Gemini 3 Flash Preview outputs. The base Qwen3-0.6B is frozen; only LoRA adapters (rank=16, alpha=32) on q_proj and v_proj across all 28 attention layers are trained.

  • Training data: 6,720 (text, summary) pairs. Text is markdown content from a personal knowledge management system (chat logs, task descriptions, project notes, bug reports). Summaries were generated by Gemini 3 Flash Preview.
  • Split: 6,048 train / 672 validation
  • Optimizer: AdamW, lr=2e-4, weight_decay=0.01, cosine schedule
  • Mixed precision: float16 with GradScaler
  • Batch size: 8
  • Epochs: 5 (best at epoch 3)
  • Hardware: NVIDIA L4 (24GB) on RunPod
  • Training time: 31 minutes
  • Training cost: ~$0.20

Prompt Format

The model was trained with this exact prompt template:

Summarize in one sentence:
{text}

Summary:

Use this format for best results. The model outputs a single sentence and stops.

Training Curves

Training curves

Best validation loss of 1.136 at epoch 3. Mild overfitting begins at epoch 4.

Why Distillation?

We tried five approaches before settling on distillation:

Approach Val Loss Quality
Prefix tuning (embedding โ†’ soft tokens) 1.15-1.24 Hallucinated entity names
LoRA + embedding projection 1.14-1.16 Better but still imprecise on details
Text distillation (this model) 1.14 Near-verbatim reproduction

The key insight: embedding vectors don't encode specific details (PR numbers, app names, exact error messages). By training directly on raw text, the model can see and reproduce those details. The distillation approach produces summaries that are nearly indistinguishable from the Gemini originals.

Sample Generations

From the validation set (unseen during training):

Reference (Gemini) Generated (This Model)
Batch deployment โ€” merged seven branches into clouderic including FolkReel query fixes, desktop/mobi... Batch deployment โ€” merged seven branches for clouderic including folkreel query field fixes, desktop...
FolkReel planning cycle โ€” reviewed project state and strategic focus on AI-led interview elicitation... FolkReel planning cycle โ€” reviewed project state to prioritize interview elicitation and the iterati...
Bug fix for Claude configuration error โ€” adding symlink-on-startup logic to the Go binary to ensure... Claude configuration fix โ€” adding symlink-on-startup logic to Go binary and running tests to resolv...
WebUI Next planning cycle โ€” enforcing a tight 10-minute loop to address subpar product quality... WebUI Next planning cycle โ€” enforcing a short cadence to address poor visual design and interaction...

Limitations

  • Domain-specific: Trained on software engineering/devops content. Will work on general text but style is tuned for technical summaries.
  • Single sentence: Always outputs one sentence. Not suitable for multi-paragraph summarization.
  • English only: Trained exclusively on English text.
  • Max input ~2K chars: Longer texts get truncated. For very long documents, consider chunking.
  • No thinking/reasoning: This is a distilled model โ€” it pattern-matches rather than reasons about content.

License

Apache 2.0 (same as the base Qwen3-0.6B model).

Acknowledgments

  • Qwen3-0.6B by Alibaba Cloud โ€” the base model
  • Gemini 3 Flash Preview by Google โ€” generated the training summaries
  • llama.cpp โ€” GGUF format and CPU inference
  • RunPod โ€” GPU training infrastructure
  • Built as part of the Cloud Eric project
Downloads last month
52
GGUF
Model size
0.6B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ericflo/qwen3-0.6b-summarizer

Finetuned
Qwen/Qwen3-0.6B
Adapter
(351)
this model

Evaluation results