OpenMOSE/Qwen3.5-REAP-97B-A10B

Vision–Language MoE model created by applying Router-weighted Expert Activation Pruning (REAP) to Qwen3.5-122B-A10B.

1. Model Summary

Base model: Qwen/Qwen3.5-122B-A10B (vision–language MoE LLM)
Variant name: Qwen3.5-REAP-97B-A10B
Architecture: Decoder-only Transformer (hybrid linear/full attention) + MoE MLP experts, with vision encoder + VL fusion as in Qwen3.5
Pruning method: REAP (Router-weighted Expert Activation Pruning) by Cerebras Research https://github.com/CerebrasResearch/reap
Expert sparsity: ~22% of MoE experts pruned globally (256 → 200 experts)
Active parameters: "A10B" indicates roughly ~10B active parameters per token (MoE sparse activation, 8 experts per token), while total parameters are reduced to about 97B
Modality: Text + Vision (VL support kept intact)
License: Apache 2.0
Author / Maintainer: OpenMOSE
Year: 2025

This is an unofficial community variant of Qwen3.5, not affiliated with or endorsed by Alibaba or Cerebras Systems.

2. What Is REAP and What Did We Change?

REAP (Router-weighted Expert Activation Pruning) is a pruning method for MoE models that uses:

Router statistics (routing probabilities)
Expert activation patterns on a calibration set

to identify under-used or redundant experts and prune them while preserving model quality as much as possible.

For this model:

We applied REAP to Qwen3.5-122B-A10B across its MoE MLP blocks.
~22% of experts are pruned (256 → 200), based on router-weighted activation statistics.
The routing mechanism itself is not conceptually changed; we only changed which experts remain.
We extended the original REAP implementation to support the Qwen3.5 hybrid architecture (interleaved linear attention + full attention layers), so pruning can be applied without disrupting either attention pathway or VL functionality.

In short: same REAP algorithm, adapted to Qwen3.5's hybrid linear/full attention MoE architecture, leaving VL functionality available.

3. Calibration Data

The REAP pruning statistics were computed using:

Calibration dataset: https://huggingface.co/datasets/OpenMOSE/reap-calib-mix
This dataset is mostly synthetic, generated by Qwen3-235B-Instruct on mixed prompts designed to cover:
- General instruction-following
- Reasoning and long-form text

The calibration set is not used for additional fine-tuning; it is used solely to measure router/expert activations to decide which experts to prune.

4. Why 97B-A10B? (Motivation & Hardware Footprint)

Qwen3.5-122B-A10B is one of the most frontier-capable open-source models runnable at 96 GB, but it still exceeds what a typical 48 GB GPU setup can handle. By pruning ~22% of experts:

The model shrinks from ~122B total parameters to about 97B total parameters.
Sparse MoE activation keeps around 10B parameters active per token ("A10B"), same effective compute profile as the base model.
In practice, this makes the model feasible to deploy on a 48 GB GPU with modest CPU offload:
- Aggressive quantization (e.g., Q4) eliminates the need for offloading on 96 GB setups.
- 48 GB configurations can work with partial layer offloading to CPU.

The overarching goal is to bring the closest OSS approximation to a frontier model into local deployment, making Qwen3.5-122B-A10B accessible without requiring multi-GPU nodes.

5. Architecture Notes

Key architectural properties inherited from the base model and preserved after pruning:

Property	Value
Hidden size	3072
Num layers	48
Attention type	Hybrid (3× linear + 1× full, repeating)
Full attention interval	every 4th layer
Num attention heads	32
Num KV heads	2
Head dim	256
MoE experts total	200 (pruned from 256)
Experts per token	8
MoE intermediate size	1024
Max context length	262,144 tokens
Vocab size	248,320

The hybrid attention design (linear attention layers interleaved with full attention every 4 layers) is a distinctive feature of the Qwen3.5 family and is retained fully after pruning.

6. Intended Use

Primary intended uses

Research on:
- MoE pruning and compression (especially REAP) applied to hybrid attention architectures
- Scaling behavior of pruned MoE VL models under conservative pruning ratios (~22%)
- Trade-offs between expert sparsity and performance in linear-attention hybrids
Experimental deployment for:
- Vision–language assistants on constrained hardware
- Multimodal chatbots
- Document + image understanding

Suitable tasks (examples)

Multimodal chat (image + text → text)
Image captioning / description
Visual question answering
General instruction-following and long-form text generation
Long-context reasoning (up to 262K tokens)

Out-of-scope / high-risk uses

This model should not be used without additional safeguards for:

Medical, legal, or financial advice
Safety-critical decision making
Political persuasion or targeted disinformation
Any scenario where incorrect or biased outputs can cause real-world harm

7. Limitations & Risks

This model inherits all the limitations of Qwen3.5-122B-A10B plus those introduced by pruning:

Hallucinations: The model can generate plausible but incorrect facts.
Bias & toxicity: Biases from the original training data and synthetic calibration data remain and may be amplified.
Distribution shift from pruning:
- Some long-tail behaviors or rare domain knowledge may degrade due to removal of 56 experts.
- Performance may be uneven across tasks or languages underrepresented in the calibration set.
Multimodal edge cases:
- Complex compositional visual reasoning or high-resolution images may not work reliably.
- VL behavior is preserved but not re-tuned after pruning.
Hybrid attention sensitivity:
- Linear attention layers are more sensitive to expert distribution changes than standard full attention; this is a known risk with the Qwen3.5 architecture.

Users should perform their own evaluation before relying on the model in any sensitive context.

8. How to Use

Note: Requires transformers >= 4.57.0.dev0 for Qwen3.5 MoE support (Qwen3_5MoeForConditionalGeneration).

import torch
from transformers import AutoProcessor
from transformers.models.qwen3_5_moe import Qwen3_5MoeForConditionalGeneration

model_id = "OpenMOSE/Qwen3.5-REAP-97B-A10B"


# default: Load the model on the available device(s)
model = Qwen3_5MoeForConditionalGeneration.from_pretrained(
    model_id, dtype="auto", device_map="auto"#, use_cache = False
)


processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
            },
            {"type": "text", "text": "Describe the image."},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)

print(inputs)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=256,do_sample=True,temperature=0.7,     # ← 1.0より小さいほど保守的
    top_p=0.9)          # or top_k=50 など)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

For text-only usage, omit the images= argument.

llama.cpp / GGUF

For 48 GB GPU + CPU offload configurations, GGUF quantized versions (e.g., Q4_K_M) are recommended. Example offload configuration:

./llama-cli -m qwen3.5-reap-97b-a10b.Q4_K_M.gguf \
  --n-gpu-layers 38 \   # tune based on your VRAM
  --ctx-size 8192

9. Evaluation (Status)

This release focuses on making the REAP-pruned model available for the community.
Quantitative benchmarks (e.g., MMLU, GSM8K, MMBench) are still work in progress.
Early qualitative checks show:
- VL behavior is preserved after pruning at this sparsity level.
- Latency and memory usage are meaningfully reduced compared to Qwen3.5-122B-A10B, enabling 48 GB + CPU-offload deployments.
- The conservative 22% pruning ratio appears to cause less degradation than more aggressive pruning schedules.

Community contributions with detailed benchmarks are very welcome.

10. Training & Distillation Details (High-Level)

Base model: Qwen/Qwen3.5-122B-A10B
Pruning method: REAP (Router-weighted Expert Activation Pruning)
Experts: 256 → 200 (22% pruned)
Calibration data: OpenMOSE/reap-calib-mix (mostly generated by Qwen3-235B-Instruct)
Post-processing:
- Router / gating structure retained
- Experts pruned according to REAP scoring
- No additional large-scale pretraining in this release

Future versions may include post-pruning fine-tuning or knowledge distillation from the full 122B model to recover further performance.

11. Community & Contribution

Let's grow this model together as a community.

You are encouraged to:

Run benchmarks and publish results
Contribute scripts for:
- Further pruning experiments
- Quantization (GGUF, AWQ, GPTQ)
- Long-context or domain-specific fine-tuning
- CPU/GPU offload configuration guides for various hardware setups
Report issues or findings about failure modes, biases, or surprising behaviors

12. License

Model & code (this repository): Apache License 2.0
The original Qwen3.5-122B-A10B model and any downstream use must also respect their respective licenses and usage terms.

13. Acknowledgements

This architecture research and implementation was made possible with computing power and technical support from Recursal AI. We sincerely thank them for enabling this work.

https://featherless.ai/

Qwen team for building the Qwen3.5 family of models.
Cerebras Research for the REAP method and reference implementation: https://github.com/CerebrasResearch/reap
OpenMOSE community for experimentation, engineering, and calibration data generation.

2025 OpenMOSE