Qwen3.5-27B โ€” RYS Layer Surgery (GGUF)

Two modified versions of Qwen3.5-27B produced by RYS layer duplication โ€” no training, no weight changes, just routing hidden states through a specific circuit twice.

Based on David Ng's RYS method.


Files

File Layers Size
Qwen3.5-27B-UD-Q4_K_XL.gguf 64 17 GiB
Qwen3.5-27B-rys_30-33-UD-Q4_K_XL.gguf 68 21 GiB
Qwen3.5-27B-rys_34-37_eq-UD-Q4_K_XL.gguf 68 21 GiB

Probe scores

Scores from an internal sweep benchmark run during circuit search. Sample sizes are small โ€” treat these as directional indicators, not definitive benchmarks.

Model Math EQ Reasoning Logic
Base (64 layers) 0.375 11.5 0.000 0.00
rys_30-33 (68 layers) 0.438 29.5 0.353 1.00
rys_34-37 (68 layers) 0.375 39.4 0.000 0.00
  • Math: Ng's partial-credit scoring on a small GSM8K sample
  • EQ: EQ-Bench-style emotional intelligence score (0โ€“100)
  • Reasoning: fraction correct across causal, date, logic, navigation, and GSM8K probes
  • Logic: fraction correct on logical deduction probes only

rys_30-33 shows the best combined improvement across reasoning categories. rys_34-37 shows the highest EQ score but no reasoning improvement over baseline.


Benchmarks (based on BFCLv4)

Non-Live Tests

Task Qwen3.5-27B-RYS-30-34 (ฮ” vs Best) Qwen3.5-27B-FC (Baseline) Claude Opus 4.5 (FC) Claude Sonnet 4.5 (FC) GLM 4.6 (FC) Grok-4 (FC) GPT-5.2 (FC)
irrelevance 86.67% (-1.25%) 87.50% 85.83% 87.92% 85.42% 77.50% 80.00%
multiple 96.50% 96.50% 95.50% 95.50% 95.00% 92.50% 88.00%
parallel 95.00% 93.00% 93.50% 94.50% 91.50% 88.50% 89.00%
parallel_multiple 91.50% (-0.50%) 76.00% 88.50% 92.00% 89.50% 87.00% 77.50%
simple_java 62.00% (-3.00%) 65.00% 60.00% 62.00% 64.00% 62.00% 62.00%
simple_javascript 72.00% (-2.00%) 66.00% 74.00% 58.00% 64.00% 66.00% 64.00%
simple_python 95.25% (-2.50%) 95.00% 96.50% 97.75% 94.75% 92.50% 92.75%

Live Tests

Task Qwen3.5-27B-RYS-30-34 (ฮ” vs Best) Qwen3.5-27B-FC (Baseline) Claude Opus 4.5 (FC) Claude Sonnet 4.5 (FC) GLM 4.6 (FC) Grok-4 (FC) GPT-5.2 (FC)
live_irrelevance 82.24% (-3.05%) 80.88% 83.60% 85.29% 84.50% 73.30% 78.85%
live_multiple 79.68% (-1.14%) 80.82% 78.16% 78.92% 78.92% 73.88% 70.37%
live_parallel 81.25% (-6.25%) 87.50% 87.50% 87.50% 81.25% 75.00% 68.75%
live_parallel_multiple 75.00% (-8.33%) 79.17% 75.00% 83.33% 75.00% 79.17% 58.33%
live_relevance 81.25% (-6.25%) 68.75% 62.50% 68.75% 75.00% 87.50% 75.00%
live_simple 84.50% (-5.03%) 87.60% 86.43% 89.53% 89.53% 82.17% 71.71%

Multi-Turn Tests

Task Qwen3.5-27B-RYS-30-34 (ฮ” vs Best) Qwen3.5-27B-FC (Baseline) Claude Opus 4.5 (FC) Claude Sonnet 4.5 (FC) GLM 4.6 (FC) Grok-4 (FC) GPT-5.2 (FC)
multi_turn_base 74.50% (-6.50%) 70.50% 81.00% 69.00% 74.50% 44.00% 36.50%
multi_turn_long_context 67.50% (-3.00%) 59.00% 70.50% 59.00% 66.50% 44.00% 30.50%

Memory Tests (Agentic)

Task Qwen3.5-27B-RYS-30-34 (ฮ” vs Best) Qwen3.5-27B-FC (Baseline) Claude Opus 4.5 (FC) Claude Sonnet 4.5 (FC) GLM 4.6 (FC) Grok-4 (FC) GPT-5.2 (FC)
memory_kv 45.81% (-25.16%) N/A 70.97% 54.19% 43.87% 57.42% 33.55%
memory_rec_sum 70.97% (-12.26%) N/A 77.42% 83.23% 67.10% 51.61% 60.65%
memory_vector 63.23% (-9.67%) N/A 72.90% 57.42% 56.13% 58.71% 43.23%

RYS vs Baseline Comparison (All Tests)

Task RYS Baseline ฮ” (RYS - Baseline)
irrelevance 86.67% 87.50% -0.83%
multiple 96.50% 96.50% 0.00%
parallel 95.00% 93.00% +2.00% โœ…
parallel_multiple 91.50% 76.00% +15.50% โœ…
simple_java 62.00% 65.00% -3.00%
simple_javascript 72.00% 66.00% +6.00% โœ…
simple_python 95.25% 95.00% +0.25%
live_irrelevance 82.24% 80.88% +1.36% โœ…
live_multiple 79.68% 80.82% -1.14%
live_parallel 81.25% 87.50% -6.25%
live_parallel_multiple 75.00% 79.17% -4.17%
live_relevance 81.25% 68.75% +12.50% โœ…
live_simple 84.50% 87.60% -3.10%
multi_turn_base 74.50% 70.50% +4.00% โœ…
multi_turn_long_context 67.50% 59.00% +8.50% โœ…
memory_kv 45.81% N/A โœ…
memory_rec_sum 70.97% N/A โœ…
memory_vector 63.23% N/A โœ…

What is RYS?

Transformers self-organise during training into functional circuits โ€” contiguous blocks of layers that act together. The RYS technique duplicates a specific block in the forward pass using the same weights, with no extra copies on disk beyond the GGUF file overhead:

Normal:     0 โ†’ 1 โ†’ โ€ฆ โ†’ 29 โ†’ 30 โ†’ 31 โ†’ 32 โ†’ 33 โ†’ 34 โ†’ โ€ฆ โ†’ 63
rys_30-33:  0 โ†’ 1 โ†’ โ€ฆ โ†’ 29 โ†’ 30 โ†’ 31 โ†’ 32 โ†’ 33 โ†’ 30 โ†’ 31 โ†’ 32 โ†’ 33 โ†’ 34 โ†’ โ€ฆ โ†’ 63

The model processes the same circuit twice, without any weight changes or fine-tuning.


Hybrid Mamba/attention architecture constraint

Qwen3.5-27B is a hybrid SSM/attention model (full_attention_interval = 4): full attention every 4th layer, Mamba SSM everywhere else.

This creates a hard constraint on layer surgery: the total layer count must remain divisible by 4.

  • Block size 4 โ†’ 64 + 4 = 68 layers (68 รท 4 = 17 โœ“)
  • Block size 3 โ†’ 64 + 3 = 67 layers (67 รท 4 = 16.75 โœ— โ†’ server crash at load)
  • Block size 8 โ†’ 64 + 8 = 72 layers (72 รท 4 = 18 โœ“)

Only multiples of 4 work as block sizes for this model family.


How the circuit was found

A two-pass sweep over the 64-layer model using a probe benchmark:

Pass 1 โ€” 8-layer blocks, stride 4, layers 4โ€“60:

  • Identified hot zones at layers 8โ€“16 (reasoning) and 28โ€“40 (EQ/math)

Pass 2 โ€” 4-layer blocks, stride 1, within each hot zone:

  • (30, 34) achieved the best combined score: reasoning=0.353, EQ=29.5, logic=1.0
  • (34, 38) achieved the highest EQ score: EQ=39.4

Each configuration was tested by patching the GGUF layer path, loading with llama-server, and scoring with the probe suite.


Usage

llama.cpp / llama-server

llama-server -m Qwen3.5-27B-rys_30-33.gguf -ngl 99 --port 8080

Thinking mode

Qwen3.5 defaults to thinking mode (<think>โ€ฆ</think>). Add /no_think to the system prompt for fast, direct answers:

messages = [
    {"role": "system", "content": "/no_think"},
    {"role": "user",   "content": "Your question here"}
]

VRAM requirements

The model weights alone are ~21 GiB (Q4_K_XL quantization, 68 layers). A single A100 80GB or H100 runs this comfortably. Consumer GPU setups depend on your llama.cpp version's tensor split support.


Credits

License

Apache 2.0 (inherited from base model)

Downloads last month
5,444
GGUF
Model size
28B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for XpressAI/Qwen3.5-27B-RYS-UD-Q4_K_XL-GGUF

Base model

Qwen/Qwen3.5-27B
Quantized
(199)
this model