Instructions to use ericflo/qwen3-0.6b-summarizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ericflo/qwen3-0.6b-summarizer with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ericflo/qwen3-0.6b-summarizer",
	filename="qwen3-0.6b-summarizer-f16.gguf",
)

llm.create_chat_completion(
	messages = "\"The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.\""
)

llama-cpp-python

How to use ericflo/qwen3-0.6b-summarizer with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ericflo/qwen3-0.6b-summarizer",
	filename="qwen3-0.6b-summarizer-f16.gguf",
)

llm.create_chat_completion(
	messages = "\"The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.\""
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use ericflo/qwen3-0.6b-summarizer with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf ericflo/qwen3-0.6b-summarizer:F16
# Run inference directly in the terminal:
llama cli -hf ericflo/qwen3-0.6b-summarizer:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf ericflo/qwen3-0.6b-summarizer:F16
# Run inference directly in the terminal:
llama cli -hf ericflo/qwen3-0.6b-summarizer:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ericflo/qwen3-0.6b-summarizer:F16
# Run inference directly in the terminal:
./llama-cli -hf ericflo/qwen3-0.6b-summarizer:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ericflo/qwen3-0.6b-summarizer:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ericflo/qwen3-0.6b-summarizer:F16

Use Docker

docker model run hf.co/ericflo/qwen3-0.6b-summarizer:F16

LM Studio
Jan
Ollama
How to use ericflo/qwen3-0.6b-summarizer with Ollama:
```
ollama run hf.co/ericflo/qwen3-0.6b-summarizer:F16
```

Unsloth Studio

How to use ericflo/qwen3-0.6b-summarizer with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ericflo/qwen3-0.6b-summarizer to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ericflo/qwen3-0.6b-summarizer to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ericflo/qwen3-0.6b-summarizer to start chatting

How to use ericflo/qwen3-0.6b-summarizer with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf ericflo/qwen3-0.6b-summarizer:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ericflo/qwen3-0.6b-summarizer:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ericflo/qwen3-0.6b-summarizer with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf ericflo/qwen3-0.6b-summarizer:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ericflo/qwen3-0.6b-summarizer:F16

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use ericflo/qwen3-0.6b-summarizer with Docker Model Runner:
```
docker model run hf.co/ericflo/qwen3-0.6b-summarizer:F16
```

Lemonade

How to use ericflo/qwen3-0.6b-summarizer with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ericflo/qwen3-0.6b-summarizer:F16

Run and chat with the model

lemonade run user.qwen3-0.6b-summarizer-F16

List all available models

lemonade list

Qwen3-0.6B Summarizer

A one-sentence summarizer fine-tuned from Qwen3-0.6B using LoRA distillation. Feed it any text, get back a concise one-sentence summary.

Trained by distilling 6,720 high-quality summaries generated by Gemini Flash into Qwen3-0.6B. The model learns to compress markdown text (chat logs, task descriptions, bug reports, planning notes) into clear, information-dense one-liners.

Example

Input:  "Eric wants to reorganize how Cloud Eric handles project planning loops.
         Currently the planning task runs every 30 minutes and creates sub-tasks,
         but it often creates duplicates because it does not check what tasks are
         already running or what PRs are already open. The fix should add a dedup
         check that reviews pending tasks and recent GitHub PRs before creating
         anything new."

Output: "Cloud Eric planning bug fix — current task creates duplicates because
         it lacks a dedup check for pending tasks and open PRs."

375 characters in, 124 out — 67% compression while preserving the root cause and fix.

Quick Start

With llama-cpp-python (CPU, no GPU needed)

from llama_cpp import Llama

llm = Llama("qwen3-0.6b-summarizer-q8_0.gguf", n_ctx=512, n_threads=8, verbose=False)

text = "Your text to summarize here..."
prompt = f"Summarize in one sentence:\n{text}\n\nSummary:"

output = llm(prompt, max_tokens=80, temperature=0.3, stop=["\n", "<|im_end|>"])
print(output["choices"][0]["text"])

With llama.cpp CLI

./llama-cli -m qwen3-0.6b-summarizer-q8_0.gguf \
  -p "Summarize in one sentence:\nYour text here\n\nSummary:" \
  -n 80 --temp 0.3

With transformers (GPU)

Apply the LoRA weights to the base model:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load base model
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")

# Load and apply LoRA weights (see lora_weights/best_distill.pt)
# Or use the pre-merged GGUF files directly with llama-cpp-python

text = "Your text here"
prompt = f"Summarize in one sentence:\n{text}\n\nSummary:"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=80, temperature=0.3)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Files

File	Size	Description
`qwen3-0.6b-summarizer-q8_0.gguf`	610 MB	Recommended. Q8_0 quantized, best speed/quality tradeoff.
`qwen3-0.6b-summarizer-f16.gguf`	1.1 GB	Full F16 precision. Slightly better quality, slower.
`lora_weights/best_distill.pt`	8.8 MB	Raw LoRA weights (PyTorch). Apply to base Qwen3-0.6B.
`training/training_metrics.json`	789 KB	Full training metrics (per-step loss, LR, grad norms).
`training/training_charts.png`	157 KB	Training loss curves visualization.
`training/gpu_distill.py`	18 KB	Training script (for reproduction).
`training/merge_and_export.py`	10 KB	LoRA merge + GGUF export script.

Performance

Metric	Value
Inference speed (CPU, 8 threads)	3-5 seconds per summary (~7 tok/s)
Inference speed (GPU)	<0.5 seconds per summary
Model load time	0.6s (Q8_0)
Average output length	~30 tokens
Max recommended input	~2,000 characters
Validation loss	1.136

Training Details

Method

LoRA distillation from Gemini 3 Flash Preview outputs. The base Qwen3-0.6B is frozen; only LoRA adapters (rank=16, alpha=32) on q_proj and v_proj across all 28 attention layers are trained.

Training data: 6,720 (text, summary) pairs. Text is markdown content from a personal knowledge management system (chat logs, task descriptions, project notes, bug reports). Summaries were generated by Gemini 3 Flash Preview.
Split: 6,048 train / 672 validation
Optimizer: AdamW, lr=2e-4, weight_decay=0.01, cosine schedule
Mixed precision: float16 with GradScaler
Batch size: 8
Epochs: 5 (best at epoch 3)
Hardware: NVIDIA L4 (24GB) on RunPod
Training time: 31 minutes
Training cost: ~$0.20

Prompt Format

The model was trained with this exact prompt template:

Summarize in one sentence:
{text}

Summary:

Use this format for best results. The model outputs a single sentence and stops.

Training Curves

Best validation loss of 1.136 at epoch 3. Mild overfitting begins at epoch 4.

Why Distillation?

We tried five approaches before settling on distillation:

Approach	Val Loss	Quality
Prefix tuning (embedding → soft tokens)	1.15-1.24	Hallucinated entity names
LoRA + embedding projection	1.14-1.16	Better but still imprecise on details
Text distillation (this model)	1.14	Near-verbatim reproduction

The key insight: embedding vectors don't encode specific details (PR numbers, app names, exact error messages). By training directly on raw text, the model can see and reproduce those details. The distillation approach produces summaries that are nearly indistinguishable from the Gemini originals.

Sample Generations

From the validation set (unseen during training):

Reference (Gemini)	Generated (This Model)
Batch deployment — merged seven branches into clouderic including FolkReel query fixes, desktop/mobi...	Batch deployment — merged seven branches for clouderic including folkreel query field fixes, desktop...
FolkReel planning cycle — reviewed project state and strategic focus on AI-led interview elicitation...	FolkReel planning cycle — reviewed project state to prioritize interview elicitation and the iterati...
Bug fix for Claude configuration error — adding symlink-on-startup logic to the Go binary to ensure...	Claude configuration fix — adding symlink-on-startup logic to Go binary and running tests to resolv...
WebUI Next planning cycle — enforcing a tight 10-minute loop to address subpar product quality...	WebUI Next planning cycle — enforcing a short cadence to address poor visual design and interaction...

Limitations

Domain-specific: Trained on software engineering/devops content. Will work on general text but style is tuned for technical summaries.
Single sentence: Always outputs one sentence. Not suitable for multi-paragraph summarization.
English only: Trained exclusively on English text.
Max input ~2K chars: Longer texts get truncated. For very long documents, consider chunking.
No thinking/reasoning: This is a distilled model — it pattern-matches rather than reasons about content.

License

Apache 2.0 (same as the base Qwen3-0.6B model).

Acknowledgments

Qwen3-0.6B by Alibaba Cloud — the base model
Gemini 3 Flash Preview by Google — generated the training summaries
llama.cpp — GGUF format and CPU inference
RunPod — GPU training infrastructure
Built as part of the Cloud Eric project

Downloads last month: 17

GGUF

Model size

0.6B params

Architecture

qwen3

Hardware compatibility

8-bit

16-bit

Model tree for ericflo/qwen3-0.6b-summarizer

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Adapter

(438)

this model

Evaluation results

Validation Loss
self-reported

1.136