OpenMOSE/RWKV-GLM-4.7-Flash-exp
PrimeRWKV: Not a Hybrid โ A Fusion.
Overview
RWKV-GLM-4.7-Flash-exp is an alpha-stage experimental model that converts GLM-4.7-Flash into a fully linear-attention-dominant architecture using the RADLADS distillation methodology. Every single layer runs RWKV-7 โ there are no standalone self-attention layers.
This model introduces PrimeRWKV, a new architectural paradigm consisting of two layer types:
| Layer Type | Count | Description |
|---|---|---|
| PrimeRWKV | 11 layers | RWKV-7 + TICA (Tiny Infused Causal Attention) |
| EfficientRWKV | 36 layers | Pure RWKV-7 linear attention |
All 47 layers are RWKV-7 layers. No exceptions.
TICA: Tiny Infused Causal Attention
Unlike conventional hybrid architectures that alternate between linear and full attention layers (e.g., Jamba, Griffin, Zamba), TICA (Tiny Infused Causal Attention) takes a fundamentally different approach: attention is infused directly within an RWKV-7 block as a lightweight gated auxiliary path, rather than occupying its own layer.
Key properties:
- NoPE (No Positional Encoding): The RWKV-7 decay mechanism handles positional information; TICA relies solely on the causal mask.
- Few-Head GQA: 4 query heads / 2 KV heads with head dimension 128 โ extremely compact.
- QK-Norm: RMSNorm on Q and K for training stability without positional encodings.
- LoRA-Gated Output: A learned sigmoid gate controls how much TICA contributes, allowing the model to modulate attention strength per-token.
- Zero-Initialized Output Projection: TICA starts with zero contribution and gradually learns its role during distillation, preserving the RWKV backbone's learned representations.
- SDPA-Compatible: Directly uses
F.scaled_dot_product_attention, enabling FlashAttention dispatch with no custom kernels. - Independent Path Addition (Pattern B): TICA output is added after the RWKV output projection, keeping gradient flows independent.
The result: TICA supplements RWKV-7's linear attention where full-context retrieval matters, without replacing it. Pure RWKV-7 handles the bulk of computation; TICA provides surgical precision where needed.
TICA Layer Placement
TICA is applied to 11 of 47 layers, concentrated in the model's mid-to-late layers where precise token recall is most critical:
Layers with TICA: [21, 23, 25, 27, 28, 30, 33, 37, 38, 41, 43]
Architecture Details
Total Parameters: 30B (MoE)
Hidden Size: 2048
Num Layers: 47
Attention Heads: 40
KV Heads: 40
Head Dim: 128 (hidden_size / num_heads)
Intermediate Size: 10240 (dense) / 1536 (MoE per expert)
MoE Experts: 64 routed + 1 shared, top-4 selection
Vocab Size: 154,880
Max Context: 202,752 tokens
Dtype: bfloat16
TICA Configuration
TICA Heads: 4 query / 2 KV (GQA)
TICA Head Dim: 128
TICA Total Q Dim: 512
TICA Total KV Dim: 256
Params per TICA: ~3.6M
Total TICA Overhead: ~40M (11 layers ร 3.6M)
RWKV-7 LoRA Ranks
Decay (w): 512
ICLR / Alpha (a): 256
Gate (g): 384
MoE Configuration
Architecture: RWKV07IMoEForCausalLM
Routing Method: noaux_tc
Routed Experts: 64
Shared Experts: 1
Experts per Token: 4
Routing Scale: 1.8
Dense Layers: 1 (first layer)
Next-Token Predict: 1 layer
How It Differs from Conventional Hybrids
| Conventional Hybrid | PrimeRWKV (This Model) | |
|---|---|---|
| Layer composition | Alternating Attention / Linear layers | All layers are RWKV-7 |
| Attention role | Full independent layer | Tiny auxiliary path inside RWKV block |
| KV cache growth | Full-dimension KV for attention layers | 256-dim KV only in 11/47 layers |
| Identity at init | Attention layers active from start | TICA zero-init, learns contribution |
| Design philosophy | Two architectures coexisting | One architecture with infused capability |
This is not hybridization. This is fusion.
Conventional hybrids treat attention and linear layers as separate entities that take turns. PrimeRWKV with TICA dissolves the boundary โ attention capability is infused into the linear attention backbone itself, creating something that is neither purely linear nor a traditional hybrid, but a unified architecture where RWKV-7 and causal attention operate as a single integrated mechanism.
Base Model & Distillation
- Base Model: GLM-4.7-Flash
- Distillation Method: Based on RADLADS (by SmerkyG), a multi-stage distillation pipeline that converts transformer attention layers into RWKV-7 linear attention while preserving model quality.
- Stage 1: Hidden state MSE alignment (layer-by-layer representation matching)
- Stage 2: KL divergence logit distillation (output distribution matching)
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"OpenMOSE/RWKV-GLM-4.7-Flash-exp",
trust_remote_code=True,
torch_dtype="bfloat16",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"OpenMOSE/RWKV-GLM-4.7-Flash-exp",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Explain the concept of linear attention."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
Limitations
- Alpha release โ not production-ready. Expect rough edges.
- Distillation quality varies across tasks; some capabilities of the original GLM-4.7-Flash may be degraded.
- Long-context performance (>32K) has not been extensively validated.
- Custom code (
trust_remote_code=True) is required.
Acknowledgments
This work would not have been possible without the generous support and contributions of the following:
featherless.ai โ for providing the computing resources that made this research possible. Their support has been invaluable, and we are deeply grateful.
SmerkyG โ author of the RADLADS distillation methodology and a key technical advisor throughout this project. His guidance on distillation strategy, training stability, and architectural decisions has been instrumental.
The RWKV Community โ for ongoing discussions, feedback, and the shared vision of making efficient architectures practical.
License
This model is released under the Apache 2.0 License, following the base model's licensing terms.
- Downloads last month
- 139