Qwen3.5-9B-RYS-0-4 (GGUF)

An experimental model created by duplicating layers 0-3 in Qwen3.5-9B using the RYS (Repeat Your Self) method.

What This Is

The base Qwen3.5-9B model (32 layers) with layers 0-3 physically duplicated, creating a 36-layer model. The forward pass executes: layers 0,1,2,3 → 0,1,2,3 (again) → 4,5,...,31.

No training, no fine-tuning, no weight changes — just architectural routing.

Method

Built using llm-circuit-finder:

python layer_path.py Qwen3.5-9B-Q4_K_M.gguf \
    Qwen3.5-9B-RYS-0-4.gguf \
    -p "0..3,0,1,2,3,4..31" -v

Architecture Discovery

Qwen 3.5 uses a hybrid [DeltaNet, DeltaNet, DeltaNet, Attention] repeating pattern (4-layer cycles). Layer duplication must use block sizes that are multiples of 4 to preserve this pattern — block size 3 crashes with "missing tensor" errors.

This is the first known application of the RYS method to a hybrid DeltaNet/Attention architecture.

Fair Evaluation Results (max_tokens=4096, no think-tag stripping)

Model Code Gen Hallucination Resistance Reasoning Overall
Baseline (32 layers) 80% 40% 100% 73.3%
RYS (0,4) (36 layers) 60% 80% 100% 80.0%

Tested with 15 questions (5 code generation, 5 hallucination detection, 5 reasoning). Both models evaluated under identical conditions.

Key finding: The improvement is primarily in hallucination resistance (+40 percentage points). Code generation shows a tradeoff (-20 percentage points). Reasoning is unchanged.

Important caveats:

  • Sample size is small (5 questions per category) — results need validation with larger benchmarks
  • The improvement may be related to response generation behavior rather than capability differences
  • The original RYS method was validated on standard transformer architectures (72B+ models); Qwen 3.5's hybrid DeltaNet architecture is untested territory
  • Side-by-side testing on harder hallucination prompts showed identical responses from both models

Circuit Map of Qwen 3.5-9B

Full sweep results (7 configurations tested):

Config Code Hallucination Reasoning Overall vs Baseline
Baseline 80% 40% 100% 73.3%
(0,4) 60% 80% 100% 80.0% +6.67%
(4,8) 80% 60% 80% 73.3% +0.00%
(8,12) 0% 40% 80% 40.0% -33.33%
(12,16) 0% 60% 80% 46.7% -26.67%
(16,20) 0% 40% 100% 46.7% -26.67%
(20,24) 60% 60% 100% 73.3% +0.00%

Usage

# With llama.cpp
llama-server -m Qwen3.5-9B-RYS-0-4.gguf -c 8192 -ngl 99

# With Ollama (create a Modelfile)
echo 'FROM ./Qwen3.5-9B-RYS-0-4.gguf' > Modelfile
ollama create qwen35-rys -f Modelfile
ollama run qwen35-rys

Specifications

  • Base model: Qwen/Qwen3.5-9B
  • Quantization: Q4_K_M
  • Original layers: 32
  • Modified layers: 36 (layers 0-3 duplicated)
  • File size: ~6.2 GB
  • Extra VRAM: ~0.5 GB over base model
  • Inference overhead: ~12% slower

References

License

Apache 2.0 (same as base model)

Downloads last month
119
GGUF
Model size
10B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for illegalcall/Qwen3.5-9B-RYS-0-4-GGUF

Finetuned
Qwen/Qwen3.5-9B
Quantized
(156)
this model

Paper for illegalcall/Qwen3.5-9B-RYS-0-4-GGUF