Instructions to use OpenMOSE/RWKV-GLM-4.7-Flash-exp with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OpenMOSE/RWKV-GLM-4.7-Flash-exp with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="OpenMOSE/RWKV-GLM-4.7-Flash-exp", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("OpenMOSE/RWKV-GLM-4.7-Flash-exp", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use OpenMOSE/RWKV-GLM-4.7-Flash-exp with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "OpenMOSE/RWKV-GLM-4.7-Flash-exp"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenMOSE/RWKV-GLM-4.7-Flash-exp",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/OpenMOSE/RWKV-GLM-4.7-Flash-exp

SGLang

How to use OpenMOSE/RWKV-GLM-4.7-Flash-exp with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "OpenMOSE/RWKV-GLM-4.7-Flash-exp" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenMOSE/RWKV-GLM-4.7-Flash-exp",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "OpenMOSE/RWKV-GLM-4.7-Flash-exp" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenMOSE/RWKV-GLM-4.7-Flash-exp",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use OpenMOSE/RWKV-GLM-4.7-Flash-exp with Docker Model Runner:
```
docker model run hf.co/OpenMOSE/RWKV-GLM-4.7-Flash-exp
```

OpenMOSE/RWKV-GLM-4.7-Flash-exp

PrimeRWKV: Not a Hybrid — A Fusion. it's P20-E27 cores

Overview

RWKV-GLM-4.7-Flash-exp is an alpha-stage experimental model that converts GLM-4.7-Flash into a fully linear-attention-dominant architecture using the RADLADS distillation methodology. Every single layer runs RWKV-7 — there are no standalone self-attention layers.

This model introduces PrimeRWKV, a new architectural paradigm consisting of two layer types:

Layer Type	Count	Description
PrimeRWKV	20 layers	RWKV-7 + TICA (Tiny Infused Causal Attention)
EfficientRWKV	27 layers	Pure RWKV-7 linear attention

All 47 layers are RWKV-7 layers. No exceptions.

TICA: Tiny Infused Causal Attention

Unlike conventional hybrid architectures that alternate between linear and full attention layers (e.g., Jamba, Griffin, Zamba), TICA (Tiny Infused Causal Attention) takes a fundamentally different approach: attention is infused directly within an RWKV-7 block as a lightweight gated auxiliary path, rather than occupying its own layer.

Key properties:

NoPE (No Positional Encoding): The RWKV-7 decay mechanism handles positional information; TICA relies solely on the causal mask.
Few-Head GQA: 4 query heads / 2 KV heads with head dimension 128 — extremely compact.
QK-Norm: RMSNorm on Q and K for training stability without positional encodings.
LoRA-Gated Output: A learned sigmoid gate controls how much TICA contributes, allowing the model to modulate attention strength per-token.
Zero-Initialized Output Projection: TICA starts with zero contribution and gradually learns its role during distillation, preserving the RWKV backbone's learned representations.
SDPA-Compatible: Directly uses F.scaled_dot_product_attention, enabling FlashAttention dispatch with no custom kernels.
Independent Path Addition (Pattern B): TICA output is added after the RWKV output projection, keeping gradient flows independent.

The result: TICA supplements RWKV-7's linear attention where full-context retrieval matters, without replacing it. Pure RWKV-7 handles the bulk of computation; TICA provides surgical precision where needed.

TICA Layer Placement

TICA is applied to 20 of 47 layers, concentrated in the model's mid-to-late layers where precise token recall is most critical:

Layers with TICA: [0,7,12,16,21,25,29,32,35,36,37,38,39,40,41,42,43,44,45,46]

Architecture Details

Total Parameters:     30B (MoE)
Hidden Size:          2048
Num Layers:           47
Attention Heads:      40
KV Heads:             40
Head Dim:             128 (hidden_size / num_heads)
Intermediate Size:    10240 (dense) / 1536 (MoE per expert)
MoE Experts:          64 routed + 1 shared, top-4 selection
Vocab Size:           154,880
Max Context:          202,752 tokens
Dtype:                bfloat16

TICA Configuration

TICA Heads:           4 query / 2 KV (GQA)
TICA Head Dim:        128
TICA Total Q Dim:     512
TICA Total KV Dim:    256
Params per TICA:      ~3.6M
Total TICA Overhead:  ~80M (20 layers × 3.6M)

RWKV-7 LoRA Ranks

Decay (w):            512
ICLR / Alpha (a):     256
Gate (g):             384

MoE Configuration

Architecture:         RWKV07IMoEForCausalLM
Routing Method:       noaux_tc
Routed Experts:       64
Shared Experts:       1
Experts per Token:    4
Routing Scale:        1.8
Dense Layers:         1 (first layer)
Next-Token Predict:   1 layer

How It Differs from Conventional Hybrids

	Conventional Hybrid	PrimeRWKV (This Model)
Layer composition	Alternating Attention / Linear layers	All layers are RWKV-7
Attention role	Full independent layer	Tiny auxiliary path inside RWKV block
KV cache growth	Full-dimension KV for attention layers	256-dim KV only in 20/47 layers
Identity at init	Attention layers active from start	TICA zero-init, learns contribution
Design philosophy	Two architectures coexisting	One architecture with infused capability

This is not hybridization. This is fusion.

Conventional hybrids treat attention and linear layers as separate entities that take turns. PrimeRWKV with TICA dissolves the boundary — attention capability is infused into the linear attention backbone itself, creating something that is neither purely linear nor a traditional hybrid, but a unified architecture where RWKV-7 and causal attention operate as a single integrated mechanism.

Base Model & Distillation

Base Model: GLM-4.7-Flash
Distillation Method: Based on RADLADS (by SmerkyG), a multi-stage distillation pipeline that converts transformer attention layers into RWKV-7 linear attention while preserving model quality.
Stage 1: Hidden state MSE alignment (layer-by-layer representation matching)
Stage 2: KL divergence logit distillation (output distribution matching)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "OpenMOSE/RWKV-GLM-4.7-Flash-exp",
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "OpenMOSE/RWKV-GLM-4.7-Flash-exp",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Explain the concept of linear attention."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)

outputs = model.generate(inputs, max_new_tokens=512, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Limitations

Alpha release — not production-ready. Expect rough edges.
Distillation quality varies across tasks; some capabilities of the original GLM-4.7-Flash may be degraded.
Long-context performance (>32K) has not been extensively validated.
Custom code (trust_remote_code=True) is required.

Acknowledgments

This work would not have been possible without the generous support and contributions of the following:

featherless.ai — for providing the computing resources that made this research possible. Their support has been invaluable, and we are deeply grateful.

SmerkyG — author of the RADLADS distillation methodology and a key technical advisor throughout this project. His guidance on distillation strategy, training stability, and architectural decisions has been instrumental.

The RWKV Community — for ongoing discussions, feedback, and the shared vision of making efficient architectures practical.

License

This model is released under the Apache 2.0 License, following the base model's licensing terms.

Downloads last month: 22

Safetensors

Model size

30B params

Tensor type

BF16