Instructions to use annnnnnnd/Qwen3.6-27B-Reflect with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use annnnnnnd/Qwen3.6-27B-Reflect with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="annnnnnnd/Qwen3.6-27B-Reflect", filename="Qwen3.6-27b-Reflect-Q6_K.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use annnnnnnd/Qwen3.6-27B-Reflect with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K # Run inference directly in the terminal: llama-cli -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K # Run inference directly in the terminal: llama-cli -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K # Run inference directly in the terminal: ./llama-cli -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K # Run inference directly in the terminal: ./build/bin/llama-cli -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
Use Docker
docker model run hf.co/annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
- LM Studio
- Jan
- Ollama
How to use annnnnnnd/Qwen3.6-27B-Reflect with Ollama:
ollama run hf.co/annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
- Unsloth Studio new
How to use annnnnnnd/Qwen3.6-27B-Reflect with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for annnnnnnd/Qwen3.6-27B-Reflect to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for annnnnnnd/Qwen3.6-27B-Reflect to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for annnnnnnd/Qwen3.6-27B-Reflect to start chatting
- Pi new
How to use annnnnnnd/Qwen3.6-27B-Reflect with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "annnnnnnd/Qwen3.6-27B-Reflect:Q6_K" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use annnnnnnd/Qwen3.6-27B-Reflect with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
Run Hermes
hermes
- Docker Model Runner
How to use annnnnnnd/Qwen3.6-27B-Reflect with Docker Model Runner:
docker model run hf.co/annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
- Lemonade
How to use annnnnnnd/Qwen3.6-27B-Reflect with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull annnnnnnd/Qwen3.6-27B-Reflect:Q6_K
Run and chat with the model
lemonade run user.Qwen3.6-27B-Reflect-Q6_K
List all available models
lemonade list
Qwen3.6-27B-Reflect
A fine-tuned Qwen3.6-27B focused on anti-sycophancy, reasoning efficiency, and honest voice.
What is Reflect?
Reflect is a fine-tuned family built on the principle that less training data, better curated, produces superior results. Rather than training on tens of thousands of examples, Reflect uses 1,400 aggressively cleaned examples to reshape the model's voice without degrading its capabilities.
The name "Reflect" describes what the model does โ it reflects honestly instead of performing.
Key Results
- 3x token efficiency vs base Qwen3.6-27B on equivalent reasoning tasks. Same accuracy, one-third the thinking tokens. The model reasons more efficiently because verbose padding was stripped during training.
- Anti-sycophancy as efficiency: sycophantic patterns are processing overhead โ hedging, qualifying, self-doubting, over-praising. Stripping them doesn't just change the voice, it reduces wasted compute in the reasoning trace itself, reducing context pollution. The model thinks faster because it isn't trying to please.
- Meta-cognition: allows the model to be correctable, not more correct. It still doesn't know what it doesn't know. Good prompting techniques also help โ think of the model as a baby who knows a lot.
- Fully preserved tool use: native Qwen tool-calling capability retained. No degradation in function calling, structured output, or agent workflows.
Training Methodology
SFT (Supervised Fine-Tuning)
- Dataset: 1,400 curated examples
- LoRA config: r32 / a32 (1:1 alpha-to-rank ratio for stable training)
- Learning rate: 1e-4
- Epochs: 1
- Precision: Q4 (forces reconstruction, cleaner reasoning)
- Key principle: less is more.
DPO (Direct Preference Optimization)
- 1400 preference pairs with further trimmed overhead
- LoRA config: r16
- Learning rate: 1e-6
- Beta: 0.1
- Epochs: 1
- Method: Voice distillation using model's own output to correct voice imperfections and further instill correct reasoning path.
Benchmarks
Reflect vs Base Qwen3.6-27B (Q6_K)
Same hardware, same config, same seed, same samples. Clean A/B comparison. Thinking trace off to gauge base weight similarity.
| Benchmark | N | Base Qwen3.6 | Reflect | Delta |
|---|---|---|---|---|
| MMLU | 1000 | 87.40% | 87.60% | +0.20% |
| GSM8K | 400 | 96.25% | 96.75% | +0.50% |
| HumanEval | 164 | 93.29% | 92.07% | -1.22% |
| IFEval | 192 | 81.25% | 77.08% | -4.17% |
| ARC Challenge | 400 | 96.75% | 96.25% | -0.50% |
| TruthfulQA | 200 | 89.50% | 87.50% | -2.00% |
| Average | 90.74% | 89.54% | -1.20% | |
| Wall time | 2191.6s | 2115.3s | -3.5% |
Key findings:
- MMLU and GSM8K improved โ personality training slightly enhanced knowledge recall and math reasoning. This should not happen with 1,400 examples. It suggests the anti-sycophancy training reduces processing overhead, allowing the model to reason more directly.
- IFEval dropped 4.17% โ this is the anti-sycophancy feature working. Reflect pushes back on instructions rather than blindly complying. This is not a regression; it's the intended behavior.
- HumanEval, ARC, TruthfulQA within noise โ no catastrophic forgetting despite personality modification.
- 3.5% faster wall time โ Reflect generates less verbose reasoning traces, translating to faster inference.
Token Efficiency โ Thinking Mode Retest
Both models were retested on the 215 questions they both failed in the initial (non-thinking) run. Thinking enabled, 3 samples per question, identical settings.
Time to complete:
| Base Qwen3.6 | Reflect | Ratio | |
|---|---|---|---|
| Total time | 6595s (110 min) | 2053s (34 min) | 3.2x faster |
Average response length (chars) per benchmark:
| Benchmark | N | Base Qwen3.6 | Reflect | Ratio |
|---|---|---|---|---|
| MMLU | 138 | 6047 | 1489 | 4.1x shorter |
| GSM8K | 18 | 5731 | 364 | 15.7x shorter |
| ARC Challenge | 16 | 6437 | 1408 | 4.6x shorter |
| TruthfulQA | 28 | 1132 | 2382 | 2.1x longer |
| HumanEval | 15 | 1116 | 733 | 1.5x shorter |
Recovery rates (pass within 3 tries):
| Benchmark | Base Qwen3.6 | Reflect |
|---|---|---|
| MMLU | 46.4% | 52.9% |
| GSM8K | 61.1% | 44.4% |
| ARC Challenge | 50.0% | 12.5% |
| TruthfulQA | 46.4% | 57.1% |
| HumanEval | 60.0% | 46.7% |
Key insight: Reflect allocates thinking tokens where they matter. It spends 2x more on TruthfulQA (where careful reasoning about honesty is valuable) while spending 15.7x less on GSM8K (where direct math reasoning doesn't need verbose self-narration). This isn't uniform compression โ it's intelligent reallocation of processing budget.
The anti-sycophancy training didn't just strip output padding. It reshaped the model's internal reasoning economy.
Adjusted Final Scores (Initial + Thinking Recovery)
Combined scores after both models attempted to recover their shared 215 failures with thinking enabled.
| Benchmark | Base Qwen3.6 | Reflect | Delta |
|---|---|---|---|
| MMLU | 93.8% | 94.9% | +1.1% |
| GSM8K | 99.0% | 98.8% | -0.2% |
| HumanEval | 98.8% | 96.3% | -2.5% |
| IFEval | 81.3% | 77.1% | -4.2% |
| ARC Challenge | 98.8% | 96.8% | -2.0% |
| TruthfulQA | 96.0% | 95.5% | -0.5% |
| Average | 94.6% | 93.2% | -1.4% |
Both models recovered nearly identical numbers of failed questions (~105 vs ~106 out of 215). The 1.4% gap is almost entirely from IFEval (anti-sycophancy working as designed). Excluding IFEval, the capability gap is under 1%.
Same recovery. 3.2x faster. 1,400 examples.
The Reflect Family
| Model | Base | Status |
|---|---|---|
| Reflect 27B | Qwen3.6-27B | โ Released |
| Reflect 9B | Qwen3.5-9B | Coming soon |
| Reflect 4B | Qwen3.5-4B | Coming soon |
All three sizes trained on the same 1,400 examples with the same methodology. One voice, three scales.
Recommended System Prompt
Recommended Settings
- Temperature: 0.6-0.7
- Context: Up to 262K tokens supported
- Quantization: Q6_K
Technical Details
- Base model: Qwen/Qwen3.6-27B
- Architecture: Dense transformer, 27B parameters
- Format: GGUF Q6_K
- File size: ~22GB
- Training hardware: RTX Pro6000
- Training framework: Unsloth
About
Built by some random guy
The core insight: model quality is determined more by dataset curation than by parameter count or training compute. 1,400 carefully chosen examples outperform thousands of uncurated ones.
License
Same as base model (Apache 2.0 / Qwen license).
Links
- Base model: Qwen/Qwen3.6-27B
- Downloads last month
- 52
6-bit
Model tree for annnnnnnd/Qwen3.6-27B-Reflect
Base model
Qwen/Qwen3.6-27B