Qwen3-4B-hybrid-OC

A hybrid model developed by Openchip by applying proprietary linearization approach LayerBoost to the Qwen3-4B, reducing latency and increasing token throughtput by 68%.

Details about LayerBoost approach are available on arXiv: https://arxiv.org/pdf/2604.22050v2

Setup

Before using this model, install the following system dependencies:

apt-get update
apt-get install -y --no-install-recommends \
    build-essential

After that, install the requirements

transformers==4.56.2
accelerate==1.11.0
peft==0.18.0
torch==2.10
flash-attn @ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.12/flash_attn-2.8.3+cu128torch2.10-cp312-cp312-linux_x86_64.whl

Example Usage

Login to Huggigface

hf auth login

Use the model

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
import torch

model_id = "openchip-sw/Qwen3-4B-LayerBoost"

device = "cuda" if torch.cuda.is_available() else "cpu"

config = AutoConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
)

model.eval()
model.to(device)

prompt = "give me the list of European countrie  accompanied with the capitals"

eval_inputs = tokenizer(prompt, return_tensors="pt")
eval_inputs = {k: v.to(device) for k, v in eval_inputs.items()}
eval_tokens = model.generate(
    **eval_inputs,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True,
    return_dict_in_generate=True,
)

out_text = tokenizer.decode(eval_tokens.sequences[0], skip_special_tokens=True)
print("Generated text:", out_text)
Downloads last month
28
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for openchip-sw/Qwen3-4B-LayerBoost