OpenMOSE/Qwen3.5-REAP-97B-A10B
Vision–Language MoE model created by applying Router-weighted Expert Activation Pruning (REAP) to Qwen3.5-122B-A10B.
1. Model Summary
- Base model: Qwen/Qwen3.5-122B-A10B (vision–language MoE LLM)
- Variant name: Qwen3.5-REAP-97B-A10B
- Architecture: Decoder-only Transformer (hybrid linear/full attention) + MoE MLP experts, with vision encoder + VL fusion as in Qwen3.5
- Pruning method: REAP (Router-weighted Expert Activation Pruning) by Cerebras Research https://github.com/CerebrasResearch/reap
- Expert sparsity: ~22% of MoE experts pruned globally (256 → 200 experts)
- Active parameters: "A10B" indicates roughly ~10B active parameters per token (MoE sparse activation, 8 experts per token), while total parameters are reduced to about 97B
- Modality: Text + Vision (VL support kept intact)
- License: Apache 2.0
- Author / Maintainer: OpenMOSE
- Year: 2025
This is an unofficial community variant of Qwen3.5, not affiliated with or endorsed by Alibaba or Cerebras Systems.
2. What Is REAP and What Did We Change?
REAP (Router-weighted Expert Activation Pruning) is a pruning method for MoE models that uses:
- Router statistics (routing probabilities)
- Expert activation patterns on a calibration set
to identify under-used or redundant experts and prune them while preserving model quality as much as possible.
For this model:
- We applied REAP to Qwen3.5-122B-A10B across its MoE MLP blocks.
- ~22% of experts are pruned (256 → 200), based on router-weighted activation statistics.
- The routing mechanism itself is not conceptually changed; we only changed which experts remain.
- We extended the original REAP implementation to support the Qwen3.5 hybrid architecture (interleaved linear attention + full attention layers), so pruning can be applied without disrupting either attention pathway or VL functionality.
In short: same REAP algorithm, adapted to Qwen3.5's hybrid linear/full attention MoE architecture, leaving VL functionality available.
3. Calibration Data
The REAP pruning statistics were computed using:
Calibration dataset: https://huggingface.co/datasets/OpenMOSE/reap-calib-mix
This dataset is mostly synthetic, generated by Qwen3-235B-Instruct on mixed prompts designed to cover:
- General instruction-following
- Reasoning and long-form text
The calibration set is not used for additional fine-tuning; it is used solely to measure router/expert activations to decide which experts to prune.
4. Why 97B-A10B? (Motivation & Hardware Footprint)
Qwen3.5-122B-A10B is one of the most frontier-capable open-source models runnable at 96 GB, but it still exceeds what a typical 48 GB GPU setup can handle. By pruning ~22% of experts:
The model shrinks from ~122B total parameters to about 97B total parameters.
Sparse MoE activation keeps around 10B parameters active per token ("A10B"), same effective compute profile as the base model.
In practice, this makes the model feasible to deploy on a 48 GB GPU with modest CPU offload:
- Aggressive quantization (e.g., Q4) eliminates the need for offloading on 96 GB setups.
- 48 GB configurations can work with partial layer offloading to CPU.
The overarching goal is to bring the closest OSS approximation to a frontier model into local deployment, making Qwen3.5-122B-A10B accessible without requiring multi-GPU nodes.
5. Architecture Notes
Key architectural properties inherited from the base model and preserved after pruning:
| Property | Value |
|---|---|
| Hidden size | 3072 |
| Num layers | 48 |
| Attention type | Hybrid (3× linear + 1× full, repeating) |
| Full attention interval | every 4th layer |
| Num attention heads | 32 |
| Num KV heads | 2 |
| Head dim | 256 |
| MoE experts total | 200 (pruned from 256) |
| Experts per token | 8 |
| MoE intermediate size | 1024 |
| Max context length | 262,144 tokens |
| Vocab size | 248,320 |
The hybrid attention design (linear attention layers interleaved with full attention every 4 layers) is a distinctive feature of the Qwen3.5 family and is retained fully after pruning.
6. Intended Use
Primary intended uses
Research on:
- MoE pruning and compression (especially REAP) applied to hybrid attention architectures
- Scaling behavior of pruned MoE VL models under conservative pruning ratios (~22%)
- Trade-offs between expert sparsity and performance in linear-attention hybrids
Experimental deployment for:
- Vision–language assistants on constrained hardware
- Multimodal chatbots
- Document + image understanding
Suitable tasks (examples)
- Multimodal chat (image + text → text)
- Image captioning / description
- Visual question answering
- General instruction-following and long-form text generation
- Long-context reasoning (up to 262K tokens)
Out-of-scope / high-risk uses
This model should not be used without additional safeguards for:
- Medical, legal, or financial advice
- Safety-critical decision making
- Political persuasion or targeted disinformation
- Any scenario where incorrect or biased outputs can cause real-world harm
7. Limitations & Risks
This model inherits all the limitations of Qwen3.5-122B-A10B plus those introduced by pruning:
Hallucinations: The model can generate plausible but incorrect facts.
Bias & toxicity: Biases from the original training data and synthetic calibration data remain and may be amplified.
Distribution shift from pruning:
- Some long-tail behaviors or rare domain knowledge may degrade due to removal of 56 experts.
- Performance may be uneven across tasks or languages underrepresented in the calibration set.
Multimodal edge cases:
- Complex compositional visual reasoning or high-resolution images may not work reliably.
- VL behavior is preserved but not re-tuned after pruning.
Hybrid attention sensitivity:
- Linear attention layers are more sensitive to expert distribution changes than standard full attention; this is a known risk with the Qwen3.5 architecture.
Users should perform their own evaluation before relying on the model in any sensitive context.
8. How to Use
Note: Requires
transformers >= 4.57.0.dev0for Qwen3.5 MoE support (Qwen3_5MoeForConditionalGeneration).
import torch
from transformers import AutoProcessor
from transformers.models.qwen3_5_moe import Qwen3_5MoeForConditionalGeneration
model_id = "OpenMOSE/Qwen3.5-REAP-97B-A10B"
# default: Load the model on the available device(s)
model = Qwen3_5MoeForConditionalGeneration.from_pretrained(
model_id, dtype="auto", device_map="auto"#, use_cache = False
)
processor = AutoProcessor.from_pretrained(model_id)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
},
{"type": "text", "text": "Describe the image."},
],
}
]
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
print(inputs)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=256,do_sample=True,temperature=0.7, # ← 1.0より小さいほど保守的
top_p=0.9) # or top_k=50 など)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
For text-only usage, omit the images= argument.
llama.cpp / GGUF
For 48 GB GPU + CPU offload configurations, GGUF quantized versions (e.g., Q4_K_M) are recommended. Example offload configuration:
./llama-cli -m qwen3.5-reap-97b-a10b.Q4_K_M.gguf \
--n-gpu-layers 38 \ # tune based on your VRAM
--ctx-size 8192
9. Evaluation (Status)
This release focuses on making the REAP-pruned model available for the community.
Quantitative benchmarks (e.g., MMLU, GSM8K, MMBench) are still work in progress.
Early qualitative checks show:
- VL behavior is preserved after pruning at this sparsity level.
- Latency and memory usage are meaningfully reduced compared to Qwen3.5-122B-A10B, enabling 48 GB + CPU-offload deployments.
- The conservative 22% pruning ratio appears to cause less degradation than more aggressive pruning schedules.
Community contributions with detailed benchmarks are very welcome.
10. Training & Distillation Details (High-Level)
Base model: Qwen/Qwen3.5-122B-A10B
Pruning method: REAP (Router-weighted Expert Activation Pruning)
Experts: 256 → 200 (22% pruned)
Calibration data:
OpenMOSE/reap-calib-mix(mostly generated by Qwen3-235B-Instruct)Post-processing:
- Router / gating structure retained
- Experts pruned according to REAP scoring
- No additional large-scale pretraining in this release
Future versions may include post-pruning fine-tuning or knowledge distillation from the full 122B model to recover further performance.
11. Community & Contribution
Let's grow this model together as a community.
You are encouraged to:
Run benchmarks and publish results
Contribute scripts for:
- Further pruning experiments
- Quantization (GGUF, AWQ, GPTQ)
- Long-context or domain-specific fine-tuning
- CPU/GPU offload configuration guides for various hardware setups
Report issues or findings about failure modes, biases, or surprising behaviors
12. License
- Model & code (this repository): Apache License 2.0
- The original Qwen3.5-122B-A10B model and any downstream use must also respect their respective licenses and usage terms.
13. Acknowledgements
This architecture research and implementation was made possible with computing power and technical support from Recursal AI. We sincerely thank them for enabling this work.
- Qwen team for building the Qwen3.5 family of models.
- Cerebras Research for the REAP method and reference implementation: https://github.com/CerebrasResearch/reap
- OpenMOSE community for experimentation, engineering, and calibration data generation.
2025 OpenMOSE
- Downloads last month
- 30