OpenMOSE/Qwen3.5-REAP-97B-A10B

Vision–Language MoE model created by applying Router-weighted Expert Activation Pruning (REAP) to Qwen3.5-122B-A10B.


1. Model Summary

  • Base model: Qwen/Qwen3.5-122B-A10B (vision–language MoE LLM)
  • Variant name: Qwen3.5-REAP-97B-A10B
  • Architecture: Decoder-only Transformer (hybrid linear/full attention) + MoE MLP experts, with vision encoder + VL fusion as in Qwen3.5
  • Pruning method: REAP (Router-weighted Expert Activation Pruning) by Cerebras Research https://github.com/CerebrasResearch/reap
  • Expert sparsity: ~22% of MoE experts pruned globally (256 → 200 experts)
  • Active parameters: "A10B" indicates roughly ~10B active parameters per token (MoE sparse activation, 8 experts per token), while total parameters are reduced to about 97B
  • Modality: Text + Vision (VL support kept intact)
  • License: Apache 2.0
  • Author / Maintainer: OpenMOSE
  • Year: 2025

This is an unofficial community variant of Qwen3.5, not affiliated with or endorsed by Alibaba or Cerebras Systems.


2. What Is REAP and What Did We Change?

REAP (Router-weighted Expert Activation Pruning) is a pruning method for MoE models that uses:

  • Router statistics (routing probabilities)
  • Expert activation patterns on a calibration set

to identify under-used or redundant experts and prune them while preserving model quality as much as possible.

For this model:

  • We applied REAP to Qwen3.5-122B-A10B across its MoE MLP blocks.
  • ~22% of experts are pruned (256 → 200), based on router-weighted activation statistics.
  • The routing mechanism itself is not conceptually changed; we only changed which experts remain.
  • We extended the original REAP implementation to support the Qwen3.5 hybrid architecture (interleaved linear attention + full attention layers), so pruning can be applied without disrupting either attention pathway or VL functionality.

In short: same REAP algorithm, adapted to Qwen3.5's hybrid linear/full attention MoE architecture, leaving VL functionality available.


3. Calibration Data

The REAP pruning statistics were computed using:

The calibration set is not used for additional fine-tuning; it is used solely to measure router/expert activations to decide which experts to prune.


4. Why 97B-A10B? (Motivation & Hardware Footprint)

Qwen3.5-122B-A10B is one of the most frontier-capable open-source models runnable at 96 GB, but it still exceeds what a typical 48 GB GPU setup can handle. By pruning ~22% of experts:

  • The model shrinks from ~122B total parameters to about 97B total parameters.

  • Sparse MoE activation keeps around 10B parameters active per token ("A10B"), same effective compute profile as the base model.

  • In practice, this makes the model feasible to deploy on a 48 GB GPU with modest CPU offload:

    • Aggressive quantization (e.g., Q4) eliminates the need for offloading on 96 GB setups.
    • 48 GB configurations can work with partial layer offloading to CPU.

The overarching goal is to bring the closest OSS approximation to a frontier model into local deployment, making Qwen3.5-122B-A10B accessible without requiring multi-GPU nodes.


5. Architecture Notes

Key architectural properties inherited from the base model and preserved after pruning:

Property Value
Hidden size 3072
Num layers 48
Attention type Hybrid (3× linear + 1× full, repeating)
Full attention interval every 4th layer
Num attention heads 32
Num KV heads 2
Head dim 256
MoE experts total 200 (pruned from 256)
Experts per token 8
MoE intermediate size 1024
Max context length 262,144 tokens
Vocab size 248,320

The hybrid attention design (linear attention layers interleaved with full attention every 4 layers) is a distinctive feature of the Qwen3.5 family and is retained fully after pruning.


6. Intended Use

Primary intended uses

  • Research on:

    • MoE pruning and compression (especially REAP) applied to hybrid attention architectures
    • Scaling behavior of pruned MoE VL models under conservative pruning ratios (~22%)
    • Trade-offs between expert sparsity and performance in linear-attention hybrids
  • Experimental deployment for:

    • Vision–language assistants on constrained hardware
    • Multimodal chatbots
    • Document + image understanding

Suitable tasks (examples)

  • Multimodal chat (image + text → text)
  • Image captioning / description
  • Visual question answering
  • General instruction-following and long-form text generation
  • Long-context reasoning (up to 262K tokens)

Out-of-scope / high-risk uses

This model should not be used without additional safeguards for:

  • Medical, legal, or financial advice
  • Safety-critical decision making
  • Political persuasion or targeted disinformation
  • Any scenario where incorrect or biased outputs can cause real-world harm

7. Limitations & Risks

This model inherits all the limitations of Qwen3.5-122B-A10B plus those introduced by pruning:

  • Hallucinations: The model can generate plausible but incorrect facts.

  • Bias & toxicity: Biases from the original training data and synthetic calibration data remain and may be amplified.

  • Distribution shift from pruning:

    • Some long-tail behaviors or rare domain knowledge may degrade due to removal of 56 experts.
    • Performance may be uneven across tasks or languages underrepresented in the calibration set.
  • Multimodal edge cases:

    • Complex compositional visual reasoning or high-resolution images may not work reliably.
    • VL behavior is preserved but not re-tuned after pruning.
  • Hybrid attention sensitivity:

    • Linear attention layers are more sensitive to expert distribution changes than standard full attention; this is a known risk with the Qwen3.5 architecture.

Users should perform their own evaluation before relying on the model in any sensitive context.


8. How to Use

Note: Requires transformers >= 4.57.0.dev0 for Qwen3.5 MoE support (Qwen3_5MoeForConditionalGeneration).

import torch
from transformers import AutoProcessor
from transformers.models.qwen3_5_moe import Qwen3_5MoeForConditionalGeneration

model_id = "OpenMOSE/Qwen3.5-REAP-97B-A10B"


# default: Load the model on the available device(s)
model = Qwen3_5MoeForConditionalGeneration.from_pretrained(
    model_id, dtype="auto", device_map="auto"#, use_cache = False
)


processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
            },
            {"type": "text", "text": "Describe the image."},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)

print(inputs)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=256,do_sample=True,temperature=0.7,     # ← 1.0より小さいほど保守的
    top_p=0.9)          # or top_k=50 など)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

For text-only usage, omit the images= argument.

llama.cpp / GGUF

For 48 GB GPU + CPU offload configurations, GGUF quantized versions (e.g., Q4_K_M) are recommended. Example offload configuration:

./llama-cli -m qwen3.5-reap-97b-a10b.Q4_K_M.gguf \
  --n-gpu-layers 38 \   # tune based on your VRAM
  --ctx-size 8192

9. Evaluation (Status)

  • This release focuses on making the REAP-pruned model available for the community.

  • Quantitative benchmarks (e.g., MMLU, GSM8K, MMBench) are still work in progress.

  • Early qualitative checks show:

    • VL behavior is preserved after pruning at this sparsity level.
    • Latency and memory usage are meaningfully reduced compared to Qwen3.5-122B-A10B, enabling 48 GB + CPU-offload deployments.
    • The conservative 22% pruning ratio appears to cause less degradation than more aggressive pruning schedules.

Community contributions with detailed benchmarks are very welcome.


10. Training & Distillation Details (High-Level)

  • Base model: Qwen/Qwen3.5-122B-A10B

  • Pruning method: REAP (Router-weighted Expert Activation Pruning)

  • Experts: 256 → 200 (22% pruned)

  • Calibration data: OpenMOSE/reap-calib-mix (mostly generated by Qwen3-235B-Instruct)

  • Post-processing:

    • Router / gating structure retained
    • Experts pruned according to REAP scoring
    • No additional large-scale pretraining in this release

Future versions may include post-pruning fine-tuning or knowledge distillation from the full 122B model to recover further performance.


11. Community & Contribution

Let's grow this model together as a community.

You are encouraged to:

  • Run benchmarks and publish results

  • Contribute scripts for:

    • Further pruning experiments
    • Quantization (GGUF, AWQ, GPTQ)
    • Long-context or domain-specific fine-tuning
    • CPU/GPU offload configuration guides for various hardware setups
  • Report issues or findings about failure modes, biases, or surprising behaviors


12. License

  • Model & code (this repository): Apache License 2.0
  • The original Qwen3.5-122B-A10B model and any downstream use must also respect their respective licenses and usage terms.

13. Acknowledgements

This architecture research and implementation was made possible with computing power and technical support from Recursal AI. We sincerely thank them for enabling this work.

https://featherless.ai/

  • Qwen team for building the Qwen3.5 family of models.
  • Cerebras Research for the REAP method and reference implementation: https://github.com/CerebrasResearch/reap
  • OpenMOSE community for experimentation, engineering, and calibration data generation.

2025 OpenMOSE

Downloads last month
30
Safetensors
Model size
97B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenMOSE/Qwen3.5-REAP-97B-A10B

Finetuned
(4)
this model
Quantizations
2 models