Qwen3.5-27B — RYS Layer Surgery (GGUF)

Two modified versions of Qwen3.5-27B produced by RYS layer duplication — no training, no weight changes, just routing hidden states through a specific circuit twice.

Based on David Ng's RYS method.

Files

File	Layers	Size
`Qwen3.5-27B-UD-Q4_K_XL.gguf`	64	17 GiB
`Qwen3.5-27B-rys_30-33-UD-Q4_K_XL.gguf`	68	21 GiB
`Qwen3.5-27B-rys_34-37_eq-UD-Q4_K_XL.gguf`	68	21 GiB

Probe scores

Scores from an internal sweep benchmark run during circuit search. Sample sizes are small — treat these as directional indicators, not definitive benchmarks.

Model	Math	EQ	Reasoning	Logic
Base (64 layers)	0.375	11.5	0.000	0.00
rys_30-33 (68 layers)	0.438	29.5	0.353	1.00
rys_34-37 (68 layers)	0.375	39.4	0.000	0.00

Math: Ng's partial-credit scoring on a small GSM8K sample
EQ: EQ-Bench-style emotional intelligence score (0–100)
Reasoning: fraction correct across causal, date, logic, navigation, and GSM8K probes
Logic: fraction correct on logical deduction probes only

rys_30-33 shows the best combined improvement across reasoning categories. rys_34-37 shows the highest EQ score but no reasoning improvement over baseline.

Benchmarks (based on BFCLv4)

Non-Live Tests

Task	Qwen3.5-27B-RYS-30-34 (Δ vs Best)	Qwen3.5-27B-FC (Baseline)	Claude Opus 4.5 (FC)	Claude Sonnet 4.5 (FC)	GLM 4.6 (FC)	Grok-4 (FC)	GPT-5.2 (FC)
irrelevance	86.67% (-1.25%)	87.50%	85.83%	87.92%	85.42%	77.50%	80.00%
multiple	96.50%	96.50%	95.50%	95.50%	95.00%	92.50%	88.00%
parallel	95.00%	93.00%	93.50%	94.50%	91.50%	88.50%	89.00%
parallel_multiple	91.50% (-0.50%)	76.00%	88.50%	92.00%	89.50%	87.00%	77.50%
simple_java	62.00% (-3.00%)	65.00%	60.00%	62.00%	64.00%	62.00%	62.00%
simple_javascript	72.00% (-2.00%)	66.00%	74.00%	58.00%	64.00%	66.00%	64.00%
simple_python	95.25% (-2.50%)	95.00%	96.50%	97.75%	94.75%	92.50%	92.75%

Live Tests

Task	Qwen3.5-27B-RYS-30-34 (Δ vs Best)	Qwen3.5-27B-FC (Baseline)	Claude Opus 4.5 (FC)	Claude Sonnet 4.5 (FC)	GLM 4.6 (FC)	Grok-4 (FC)	GPT-5.2 (FC)
live_irrelevance	82.24% (-3.05%)	80.88%	83.60%	85.29%	84.50%	73.30%	78.85%
live_multiple	79.68% (-1.14%)	80.82%	78.16%	78.92%	78.92%	73.88%	70.37%
live_parallel	81.25% (-6.25%)	87.50%	87.50%	87.50%	81.25%	75.00%	68.75%
live_parallel_multiple	75.00% (-8.33%)	79.17%	75.00%	83.33%	75.00%	79.17%	58.33%
live_relevance	81.25% (-6.25%)	68.75%	62.50%	68.75%	75.00%	87.50%	75.00%
live_simple	84.50% (-5.03%)	87.60%	86.43%	89.53%	89.53%	82.17%	71.71%

Multi-Turn Tests

Task	Qwen3.5-27B-RYS-30-34 (Δ vs Best)	Qwen3.5-27B-FC (Baseline)	Claude Opus 4.5 (FC)	Claude Sonnet 4.5 (FC)	GLM 4.6 (FC)	Grok-4 (FC)	GPT-5.2 (FC)
multi_turn_base	74.50% (-6.50%)	70.50%	81.00%	69.00%	74.50%	44.00%	36.50%
multi_turn_long_context	67.50% (-3.00%)	59.00%	70.50%	59.00%	66.50%	44.00%	30.50%

Memory Tests (Agentic)

Task	Qwen3.5-27B-RYS-30-34 (Δ vs Best)	Qwen3.5-27B-FC (Baseline)	Claude Opus 4.5 (FC)	Claude Sonnet 4.5 (FC)	GLM 4.6 (FC)	Grok-4 (FC)	GPT-5.2 (FC)
memory_kv	45.81% (-25.16%)	N/A	70.97%	54.19%	43.87%	57.42%	33.55%
memory_rec_sum	70.97% (-12.26%)	N/A	77.42%	83.23%	67.10%	51.61%	60.65%
memory_vector	63.23% (-9.67%)	N/A	72.90%	57.42%	56.13%	58.71%	43.23%

RYS vs Baseline Comparison (All Tests)

Task	RYS	Baseline	Δ (RYS - Baseline)
irrelevance	86.67%	87.50%	-0.83%
multiple	96.50%	96.50%	0.00%
parallel	95.00%	93.00%	+2.00% ✅
parallel_multiple	91.50%	76.00%	+15.50% ✅
simple_java	62.00%	65.00%	-3.00%
simple_javascript	72.00%	66.00%	+6.00% ✅
simple_python	95.25%	95.00%	+0.25%
live_irrelevance	82.24%	80.88%	+1.36% ✅
live_multiple	79.68%	80.82%	-1.14%
live_parallel	81.25%	87.50%	-6.25%
live_parallel_multiple	75.00%	79.17%	-4.17%
live_relevance	81.25%	68.75%	+12.50% ✅
live_simple	84.50%	87.60%	-3.10%
multi_turn_base	74.50%	70.50%	+4.00% ✅
multi_turn_long_context	67.50%	59.00%	+8.50% ✅
memory_kv	45.81%	N/A	✅
memory_rec_sum	70.97%	N/A	✅
memory_vector	63.23%	N/A	✅

What is RYS?

Transformers self-organise during training into functional circuits — contiguous blocks of layers that act together. The RYS technique duplicates a specific block in the forward pass using the same weights, with no extra copies on disk beyond the GGUF file overhead:

Normal:     0 → 1 → … → 29 → 30 → 31 → 32 → 33 → 34 → … → 63
rys_30-33:  0 → 1 → … → 29 → 30 → 31 → 32 → 33 → 30 → 31 → 32 → 33 → 34 → … → 63

The model processes the same circuit twice, without any weight changes or fine-tuning.

Hybrid Mamba/attention architecture constraint

Qwen3.5-27B is a hybrid SSM/attention model (full_attention_interval = 4): full attention every 4th layer, Mamba SSM everywhere else.

This creates a hard constraint on layer surgery: the total layer count must remain divisible by 4.

Block size 4 → 64 + 4 = 68 layers (68 ÷ 4 = 17 ✓)
Block size 3 → 64 + 3 = 67 layers (67 ÷ 4 = 16.75 ✗ → server crash at load)
Block size 8 → 64 + 8 = 72 layers (72 ÷ 4 = 18 ✓)

Only multiples of 4 work as block sizes for this model family.

How the circuit was found

A two-pass sweep over the 64-layer model using a probe benchmark:

Pass 1 — 8-layer blocks, stride 4, layers 4–60:

Identified hot zones at layers 8–16 (reasoning) and 28–40 (EQ/math)

Pass 2 — 4-layer blocks, stride 1, within each hot zone:

(30, 34) achieved the best combined score: reasoning=0.353, EQ=29.5, logic=1.0
(34, 38) achieved the highest EQ score: EQ=39.4

Each configuration was tested by patching the GGUF layer path, loading with llama-server, and scoring with the probe suite.

Usage

llama.cpp / llama-server

llama-server -m Qwen3.5-27B-rys_30-33.gguf -ngl 99 --port 8080

Thinking mode

Qwen3.5 defaults to thinking mode (<think>…</think>). Add /no_think to the system prompt for fast, direct answers:

messages = [
    {"role": "system", "content": "/no_think"},
    {"role": "user",   "content": "Your question here"}
]

VRAM requirements

The model weights alone are ~21 GiB (Q4_K_XL quantization, 68 layers). A single A100 80GB or H100 runs this comfortably. Consumer GPU setups depend on your llama.cpp version's tensor split support.

Credits

David Ng for the original RYS method
Unsloth for the base Q4_K_XL GGUF quantization
Qwen team for Qwen3.5-27B
llama.cpp for local inference

License

Apache 2.0 (inherited from base model)

Downloads last month: 5,444

GGUF

Model size

28B params

Architecture

qwen35

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for XpressAI/Qwen3.5-27B-RYS-UD-Q4_K_XL-GGUF

Base model

Qwen/Qwen3.5-27B

Quantized

(199)

this model