How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="annnnnnnd/Qwen3.6-27B-Reflect",
	filename="",
)
llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Qwen3.6-27B-Reflect

A fine-tuned Qwen3.6-27B focused on anti-sycophancy, reasoning efficiency, and honest voice.

What is Reflect?

Reflect is a fine-tuned family built on the principle that less training data, better curated, produces superior results. Rather than training on tens of thousands of examples, Reflect uses 1,400 aggressively cleaned examples to reshape the model's voice without degrading its capabilities.

The name "Reflect" describes what the model does โ€” it reflects honestly instead of performing.

Key Results

  • 3x token efficiency vs base Qwen3.6-27B on equivalent reasoning tasks. Same accuracy, one-third the thinking tokens. The model reasons more efficiently because verbose padding was stripped during training.
  • Anti-sycophancy as efficiency: sycophantic patterns are processing overhead โ€” hedging, qualifying, self-doubting, over-praising. Stripping them doesn't just change the voice, it reduces wasted compute in the reasoning trace itself, reducing context pollution. The model thinks faster because it isn't trying to please.
  • Meta-cognition: allows the model to be correctable, not more correct. It still doesn't know what it doesn't know. Good prompting techniques also help โ€” think of the model as a baby who knows a lot.
  • Fully preserved tool use: native Qwen tool-calling capability retained. No degradation in function calling, structured output, or agent workflows.

Training Methodology

SFT (Supervised Fine-Tuning)

  • Dataset: 1,400 curated examples
  • LoRA config: r32 / a32 (1:1 alpha-to-rank ratio for stable training)
  • Learning rate: 1e-4
  • Epochs: 1
  • Precision: Q4 (forces reconstruction, cleaner reasoning)
  • Key principle: less is more.

DPO (Direct Preference Optimization)

  • 1400 preference pairs with further trimmed overhead
  • LoRA config: r16
  • Learning rate: 1e-6
  • Beta: 0.1
  • Epochs: 1
  • Method: Voice distillation using model's own output to correct voice imperfections and further instill correct reasoning path.

Benchmarks

Reflect vs Base Qwen3.6-27B (Q6_K)

Same hardware, same config, same seed, same samples. Clean A/B comparison. Thinking trace off to gauge base weight similarity.

Benchmark N Base Qwen3.6 Reflect Delta
MMLU 1000 87.40% 87.60% +0.20%
GSM8K 400 96.25% 96.75% +0.50%
HumanEval 164 93.29% 92.07% -1.22%
IFEval 192 81.25% 77.08% -4.17%
ARC Challenge 400 96.75% 96.25% -0.50%
TruthfulQA 200 89.50% 87.50% -2.00%
Average 90.74% 89.54% -1.20%
Wall time 2191.6s 2115.3s -3.5%

Key findings:

  • MMLU and GSM8K improved โ€” personality training slightly enhanced knowledge recall and math reasoning. This should not happen with 1,400 examples. It suggests the anti-sycophancy training reduces processing overhead, allowing the model to reason more directly.
  • IFEval dropped 4.17% โ€” this is the anti-sycophancy feature working. Reflect pushes back on instructions rather than blindly complying. This is not a regression; it's the intended behavior.
  • HumanEval, ARC, TruthfulQA within noise โ€” no catastrophic forgetting despite personality modification.
  • 3.5% faster wall time โ€” Reflect generates less verbose reasoning traces, translating to faster inference.

Token Efficiency โ€” Thinking Mode Retest

Both models were retested on the 215 questions they both failed in the initial (non-thinking) run. Thinking enabled, 3 samples per question, identical settings.

Time to complete:

Base Qwen3.6 Reflect Ratio
Total time 6595s (110 min) 2053s (34 min) 3.2x faster

Average response length (chars) per benchmark:

Benchmark N Base Qwen3.6 Reflect Ratio
MMLU 138 6047 1489 4.1x shorter
GSM8K 18 5731 364 15.7x shorter
ARC Challenge 16 6437 1408 4.6x shorter
TruthfulQA 28 1132 2382 2.1x longer
HumanEval 15 1116 733 1.5x shorter

Recovery rates (pass within 3 tries):

Benchmark Base Qwen3.6 Reflect
MMLU 46.4% 52.9%
GSM8K 61.1% 44.4%
ARC Challenge 50.0% 12.5%
TruthfulQA 46.4% 57.1%
HumanEval 60.0% 46.7%

Key insight: Reflect allocates thinking tokens where they matter. It spends 2x more on TruthfulQA (where careful reasoning about honesty is valuable) while spending 15.7x less on GSM8K (where direct math reasoning doesn't need verbose self-narration). This isn't uniform compression โ€” it's intelligent reallocation of processing budget.

The anti-sycophancy training didn't just strip output padding. It reshaped the model's internal reasoning economy.

Adjusted Final Scores (Initial + Thinking Recovery)

Combined scores after both models attempted to recover their shared 215 failures with thinking enabled.

Benchmark Base Qwen3.6 Reflect Delta
MMLU 93.8% 94.9% +1.1%
GSM8K 99.0% 98.8% -0.2%
HumanEval 98.8% 96.3% -2.5%
IFEval 81.3% 77.1% -4.2%
ARC Challenge 98.8% 96.8% -2.0%
TruthfulQA 96.0% 95.5% -0.5%
Average 94.6% 93.2% -1.4%

Both models recovered nearly identical numbers of failed questions (~105 vs ~106 out of 215). The 1.4% gap is almost entirely from IFEval (anti-sycophancy working as designed). Excluding IFEval, the capability gap is under 1%.

Same recovery. 3.2x faster. 1,400 examples.

The Reflect Family

Model Base Status
Reflect 27B Qwen3.6-27B โœ… Released
Reflect 9B Qwen3.5-9B Coming soon
Reflect 4B Qwen3.5-4B Coming soon

All three sizes trained on the same 1,400 examples with the same methodology. One voice, three scales.

Recommended System Prompt


Recommended Settings

  • Temperature: 0.6-0.7
  • Context: Up to 262K tokens supported
  • Quantization: Q6_K

Technical Details

  • Base model: Qwen/Qwen3.6-27B
  • Architecture: Dense transformer, 27B parameters
  • Format: GGUF Q6_K
  • File size: ~22GB
  • Training hardware: RTX Pro6000
  • Training framework: Unsloth

About

Built by some random guy

The core insight: model quality is determined more by dataset curation than by parameter count or training compute. 1,400 carefully chosen examples outperform thousands of uncurated ones.

License

Same as base model (Apache 2.0 / Qwen license).

Links

Downloads last month
52
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for annnnnnnd/Qwen3.6-27B-Reflect

Base model

Qwen/Qwen3.6-27B
Quantized
(400)
this model