gemma-3n-E4B-it-pte

executorch .pte export of google/gemma-3n-E4B-it for on-device mobile inference

available models

variant	dtype	size	file
bf16	bfloat16	13.1 gb	Gemma3n-E4B-IT-text-only.pte
int8	int8 weights	9.6 gb	Gemma3n-E4B-text-only-int8.pte

model details

property	value
source model	google/gemma-3n-E4B-it
text parameters	7.40b
transformer layers	35
format	executorch .pte

text-only export

this export contains only the text decoder components extracted from the full multimodal gemma-3n model

included:

language_model (transformer decoder)
lm_head (output projection)

not included:

vision_tower (image encoder)
audio_tower (audio encoder)

use this export for text-only inference tasks. if you need multimodal capabilities use the original huggingface model

quantization

bf16: full bfloat16 precision weights
int8: int8 weight-only quantization via torchao - recommended for mobile deployment

note: int4 quantization requires gpu for inference and is not suitable for cpu-only mobile deployment

export configuration

fixed sequence length: 32 tokens
torch.export with strict=False
executorch to_edge conversion

usage

from executorch.runtime import Runtime

runtime = Runtime.get()
program = runtime.load_program("Gemma3n-E4B-text-only-int8.pte")
method = program.load_method("forward")

# input_ids shape: [1, 32] dtype: torch.long
output = method.execute([input_ids])
# output shape: [1, 32, 262400] dtype: torch.bfloat16

required patches

the transformers library requires two patches before export. see maceip/gemma3n-executorch for details

benchmarks

coming soon

Model tree for macmacmacmac/gemma-3n-E4B-it-pte

Base model

google/gemma-3n-E4B

Finetuned

google/gemma-3n-E4B-it