Qwen3.5-27B โ RYS Layer Surgery (GGUF)
Two modified versions of Qwen3.5-27B produced by RYS layer duplication โ no training, no weight changes, just routing hidden states through a specific circuit twice.
Based on David Ng's RYS method.
Files
| File | Layers | Size |
|---|---|---|
Qwen3.5-27B-UD-Q4_K_XL.gguf |
64 | 17 GiB |
Qwen3.5-27B-rys_30-33-UD-Q4_K_XL.gguf |
68 | 21 GiB |
Qwen3.5-27B-rys_34-37_eq-UD-Q4_K_XL.gguf |
68 | 21 GiB |
Probe scores
Scores from an internal sweep benchmark run during circuit search. Sample sizes are small โ treat these as directional indicators, not definitive benchmarks.
| Model | Math | EQ | Reasoning | Logic |
|---|---|---|---|---|
| Base (64 layers) | 0.375 | 11.5 | 0.000 | 0.00 |
| rys_30-33 (68 layers) | 0.438 | 29.5 | 0.353 | 1.00 |
| rys_34-37 (68 layers) | 0.375 | 39.4 | 0.000 | 0.00 |
- Math: Ng's partial-credit scoring on a small GSM8K sample
- EQ: EQ-Bench-style emotional intelligence score (0โ100)
- Reasoning: fraction correct across causal, date, logic, navigation, and GSM8K probes
- Logic: fraction correct on logical deduction probes only
rys_30-33 shows the best combined improvement across reasoning categories. rys_34-37 shows the highest EQ score but no reasoning improvement over baseline.
Benchmarks (based on BFCLv4)
Non-Live Tests
| Task | Qwen3.5-27B-RYS-30-34 (ฮ vs Best) | Qwen3.5-27B-FC (Baseline) | Claude Opus 4.5 (FC) | Claude Sonnet 4.5 (FC) | GLM 4.6 (FC) | Grok-4 (FC) | GPT-5.2 (FC) |
|---|---|---|---|---|---|---|---|
| irrelevance | 86.67% (-1.25%) | 87.50% | 85.83% | 87.92% | 85.42% | 77.50% | 80.00% |
| multiple | 96.50% | 96.50% | 95.50% | 95.50% | 95.00% | 92.50% | 88.00% |
| parallel | 95.00% | 93.00% | 93.50% | 94.50% | 91.50% | 88.50% | 89.00% |
| parallel_multiple | 91.50% (-0.50%) | 76.00% | 88.50% | 92.00% | 89.50% | 87.00% | 77.50% |
| simple_java | 62.00% (-3.00%) | 65.00% | 60.00% | 62.00% | 64.00% | 62.00% | 62.00% |
| simple_javascript | 72.00% (-2.00%) | 66.00% | 74.00% | 58.00% | 64.00% | 66.00% | 64.00% |
| simple_python | 95.25% (-2.50%) | 95.00% | 96.50% | 97.75% | 94.75% | 92.50% | 92.75% |
Live Tests
| Task | Qwen3.5-27B-RYS-30-34 (ฮ vs Best) | Qwen3.5-27B-FC (Baseline) | Claude Opus 4.5 (FC) | Claude Sonnet 4.5 (FC) | GLM 4.6 (FC) | Grok-4 (FC) | GPT-5.2 (FC) |
|---|---|---|---|---|---|---|---|
| live_irrelevance | 82.24% (-3.05%) | 80.88% | 83.60% | 85.29% | 84.50% | 73.30% | 78.85% |
| live_multiple | 79.68% (-1.14%) | 80.82% | 78.16% | 78.92% | 78.92% | 73.88% | 70.37% |
| live_parallel | 81.25% (-6.25%) | 87.50% | 87.50% | 87.50% | 81.25% | 75.00% | 68.75% |
| live_parallel_multiple | 75.00% (-8.33%) | 79.17% | 75.00% | 83.33% | 75.00% | 79.17% | 58.33% |
| live_relevance | 81.25% (-6.25%) | 68.75% | 62.50% | 68.75% | 75.00% | 87.50% | 75.00% |
| live_simple | 84.50% (-5.03%) | 87.60% | 86.43% | 89.53% | 89.53% | 82.17% | 71.71% |
Multi-Turn Tests
| Task | Qwen3.5-27B-RYS-30-34 (ฮ vs Best) | Qwen3.5-27B-FC (Baseline) | Claude Opus 4.5 (FC) | Claude Sonnet 4.5 (FC) | GLM 4.6 (FC) | Grok-4 (FC) | GPT-5.2 (FC) |
|---|---|---|---|---|---|---|---|
| multi_turn_base | 74.50% (-6.50%) | 70.50% | 81.00% | 69.00% | 74.50% | 44.00% | 36.50% |
| multi_turn_long_context | 67.50% (-3.00%) | 59.00% | 70.50% | 59.00% | 66.50% | 44.00% | 30.50% |
Memory Tests (Agentic)
| Task | Qwen3.5-27B-RYS-30-34 (ฮ vs Best) | Qwen3.5-27B-FC (Baseline) | Claude Opus 4.5 (FC) | Claude Sonnet 4.5 (FC) | GLM 4.6 (FC) | Grok-4 (FC) | GPT-5.2 (FC) |
|---|---|---|---|---|---|---|---|
| memory_kv | 45.81% (-25.16%) | N/A | 70.97% | 54.19% | 43.87% | 57.42% | 33.55% |
| memory_rec_sum | 70.97% (-12.26%) | N/A | 77.42% | 83.23% | 67.10% | 51.61% | 60.65% |
| memory_vector | 63.23% (-9.67%) | N/A | 72.90% | 57.42% | 56.13% | 58.71% | 43.23% |
RYS vs Baseline Comparison (All Tests)
| Task | RYS | Baseline | ฮ (RYS - Baseline) |
|---|---|---|---|
| irrelevance | 86.67% | 87.50% | -0.83% |
| multiple | 96.50% | 96.50% | 0.00% |
| parallel | 95.00% | 93.00% | +2.00% โ |
| parallel_multiple | 91.50% | 76.00% | +15.50% โ |
| simple_java | 62.00% | 65.00% | -3.00% |
| simple_javascript | 72.00% | 66.00% | +6.00% โ |
| simple_python | 95.25% | 95.00% | +0.25% |
| live_irrelevance | 82.24% | 80.88% | +1.36% โ |
| live_multiple | 79.68% | 80.82% | -1.14% |
| live_parallel | 81.25% | 87.50% | -6.25% |
| live_parallel_multiple | 75.00% | 79.17% | -4.17% |
| live_relevance | 81.25% | 68.75% | +12.50% โ |
| live_simple | 84.50% | 87.60% | -3.10% |
| multi_turn_base | 74.50% | 70.50% | +4.00% โ |
| multi_turn_long_context | 67.50% | 59.00% | +8.50% โ |
| memory_kv | 45.81% | N/A | โ |
| memory_rec_sum | 70.97% | N/A | โ |
| memory_vector | 63.23% | N/A | โ |
What is RYS?
Transformers self-organise during training into functional circuits โ contiguous blocks of layers that act together. The RYS technique duplicates a specific block in the forward pass using the same weights, with no extra copies on disk beyond the GGUF file overhead:
Normal: 0 โ 1 โ โฆ โ 29 โ 30 โ 31 โ 32 โ 33 โ 34 โ โฆ โ 63
rys_30-33: 0 โ 1 โ โฆ โ 29 โ 30 โ 31 โ 32 โ 33 โ 30 โ 31 โ 32 โ 33 โ 34 โ โฆ โ 63
The model processes the same circuit twice, without any weight changes or fine-tuning.
Hybrid Mamba/attention architecture constraint
Qwen3.5-27B is a hybrid SSM/attention model (full_attention_interval = 4): full attention every 4th layer, Mamba SSM everywhere else.
This creates a hard constraint on layer surgery: the total layer count must remain divisible by 4.
- Block size 4 โ 64 + 4 = 68 layers (68 รท 4 = 17 โ)
- Block size 3 โ 64 + 3 = 67 layers (67 รท 4 = 16.75 โ โ server crash at load)
- Block size 8 โ 64 + 8 = 72 layers (72 รท 4 = 18 โ)
Only multiples of 4 work as block sizes for this model family.
How the circuit was found
A two-pass sweep over the 64-layer model using a probe benchmark:
Pass 1 โ 8-layer blocks, stride 4, layers 4โ60:
- Identified hot zones at layers 8โ16 (reasoning) and 28โ40 (EQ/math)
Pass 2 โ 4-layer blocks, stride 1, within each hot zone:
(30, 34)achieved the best combined score: reasoning=0.353, EQ=29.5, logic=1.0(34, 38)achieved the highest EQ score: EQ=39.4
Each configuration was tested by patching the GGUF layer path, loading with llama-server, and scoring with the probe suite.
Usage
llama.cpp / llama-server
llama-server -m Qwen3.5-27B-rys_30-33.gguf -ngl 99 --port 8080
Thinking mode
Qwen3.5 defaults to thinking mode (<think>โฆ</think>). Add /no_think to the system prompt for fast, direct answers:
messages = [
{"role": "system", "content": "/no_think"},
{"role": "user", "content": "Your question here"}
]
VRAM requirements
The model weights alone are ~21 GiB (Q4_K_XL quantization, 68 layers). A single A100 80GB or H100 runs this comfortably. Consumer GPU setups depend on your llama.cpp version's tensor split support.
Credits
- David Ng for the original RYS method
- Unsloth for the base
Q4_K_XLGGUF quantization - Qwen team for Qwen3.5-27B
- llama.cpp for local inference
License
Apache 2.0 (inherited from base model)
- Downloads last month
- 5,444
4-bit
Model tree for XpressAI/Qwen3.5-27B-RYS-UD-Q4_K_XL-GGUF
Base model
Qwen/Qwen3.5-27B