OpenMOSE/RWKV-GLM-4.7-Flash-exp

PrimeRWKV: Not a Hybrid โ€” A Fusion.

Architecture Base Model Layers Status

Overview

RWKV-GLM-4.7-Flash-exp is an alpha-stage experimental model that converts GLM-4.7-Flash into a fully linear-attention-dominant architecture using the RADLADS distillation methodology. Every single layer runs RWKV-7 โ€” there are no standalone self-attention layers.

This model introduces PrimeRWKV, a new architectural paradigm consisting of two layer types:

Layer Type Count Description
PrimeRWKV 11 layers RWKV-7 + TICA (Tiny Infused Causal Attention)
EfficientRWKV 36 layers Pure RWKV-7 linear attention

All 47 layers are RWKV-7 layers. No exceptions.

TICA: Tiny Infused Causal Attention

Unlike conventional hybrid architectures that alternate between linear and full attention layers (e.g., Jamba, Griffin, Zamba), TICA (Tiny Infused Causal Attention) takes a fundamentally different approach: attention is infused directly within an RWKV-7 block as a lightweight gated auxiliary path, rather than occupying its own layer.

Key properties:

  • NoPE (No Positional Encoding): The RWKV-7 decay mechanism handles positional information; TICA relies solely on the causal mask.
  • Few-Head GQA: 4 query heads / 2 KV heads with head dimension 128 โ€” extremely compact.
  • QK-Norm: RMSNorm on Q and K for training stability without positional encodings.
  • LoRA-Gated Output: A learned sigmoid gate controls how much TICA contributes, allowing the model to modulate attention strength per-token.
  • Zero-Initialized Output Projection: TICA starts with zero contribution and gradually learns its role during distillation, preserving the RWKV backbone's learned representations.
  • SDPA-Compatible: Directly uses F.scaled_dot_product_attention, enabling FlashAttention dispatch with no custom kernels.
  • Independent Path Addition (Pattern B): TICA output is added after the RWKV output projection, keeping gradient flows independent.

The result: TICA supplements RWKV-7's linear attention where full-context retrieval matters, without replacing it. Pure RWKV-7 handles the bulk of computation; TICA provides surgical precision where needed.

TICA Layer Placement

TICA is applied to 11 of 47 layers, concentrated in the model's mid-to-late layers where precise token recall is most critical:

Layers with TICA: [21, 23, 25, 27, 28, 30, 33, 37, 38, 41, 43]

Architecture Details

Total Parameters:     30B (MoE)
Hidden Size:          2048
Num Layers:           47
Attention Heads:      40
KV Heads:             40
Head Dim:             128 (hidden_size / num_heads)
Intermediate Size:    10240 (dense) / 1536 (MoE per expert)
MoE Experts:          64 routed + 1 shared, top-4 selection
Vocab Size:           154,880
Max Context:          202,752 tokens
Dtype:                bfloat16

TICA Configuration

TICA Heads:           4 query / 2 KV (GQA)
TICA Head Dim:        128
TICA Total Q Dim:     512
TICA Total KV Dim:    256
Params per TICA:      ~3.6M
Total TICA Overhead:  ~40M (11 layers ร— 3.6M)

RWKV-7 LoRA Ranks

Decay (w):            512
ICLR / Alpha (a):     256
Gate (g):             384

MoE Configuration

Architecture:         RWKV07IMoEForCausalLM
Routing Method:       noaux_tc
Routed Experts:       64
Shared Experts:       1
Experts per Token:    4
Routing Scale:        1.8
Dense Layers:         1 (first layer)
Next-Token Predict:   1 layer

How It Differs from Conventional Hybrids

Conventional Hybrid PrimeRWKV (This Model)
Layer composition Alternating Attention / Linear layers All layers are RWKV-7
Attention role Full independent layer Tiny auxiliary path inside RWKV block
KV cache growth Full-dimension KV for attention layers 256-dim KV only in 11/47 layers
Identity at init Attention layers active from start TICA zero-init, learns contribution
Design philosophy Two architectures coexisting One architecture with infused capability

This is not hybridization. This is fusion.

Conventional hybrids treat attention and linear layers as separate entities that take turns. PrimeRWKV with TICA dissolves the boundary โ€” attention capability is infused into the linear attention backbone itself, creating something that is neither purely linear nor a traditional hybrid, but a unified architecture where RWKV-7 and causal attention operate as a single integrated mechanism.

Base Model & Distillation

  • Base Model: GLM-4.7-Flash
  • Distillation Method: Based on RADLADS (by SmerkyG), a multi-stage distillation pipeline that converts transformer attention layers into RWKV-7 linear attention while preserving model quality.
  • Stage 1: Hidden state MSE alignment (layer-by-layer representation matching)
  • Stage 2: KL divergence logit distillation (output distribution matching)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "OpenMOSE/RWKV-GLM-4.7-Flash-exp",
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "OpenMOSE/RWKV-GLM-4.7-Flash-exp",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Explain the concept of linear attention."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)

outputs = model.generate(inputs, max_new_tokens=512, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Limitations

  • Alpha release โ€” not production-ready. Expect rough edges.
  • Distillation quality varies across tasks; some capabilities of the original GLM-4.7-Flash may be degraded.
  • Long-context performance (>32K) has not been extensively validated.
  • Custom code (trust_remote_code=True) is required.

Acknowledgments

This work would not have been possible without the generous support and contributions of the following:

featherless.ai โ€” for providing the computing resources that made this research possible. Their support has been invaluable, and we are deeply grateful.

SmerkyG โ€” author of the RADLADS distillation methodology and a key technical advisor throughout this project. His guidance on distillation strategy, training stability, and architectural decisions has been instrumental.

The RWKV Community โ€” for ongoing discussions, feedback, and the shared vision of making efficient architectures practical.

License

This model is released under the Apache 2.0 License, following the base model's licensing terms.

Downloads last month
139
Safetensors
Model size
30B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support