macmacmacmac
/

gemma-3n-E4B-it-pte

@@ -14,6 +14,13 @@ base_model: google/gemma-3n-E4B-it
 executorch .pte export of google/gemma-3n-E4B-it for on-device mobile inference
 ## model details
 | property | value |
@@ -21,9 +28,7 @@ executorch .pte export of google/gemma-3n-E4B-it for on-device mobile inference
 | source model | google/gemma-3n-E4B-it |
 | text parameters | 7.40b |
 | transformer layers | 35 |
-| dtype | float16 |
 | format | executorch .pte |
-| output size | 13.1 gb |
 ## text-only export
@@ -39,6 +44,13 @@ this export contains only the text decoder components extracted from the full mu
 use this export for text-only inference tasks. if you need multimodal capabilities use the original huggingface model
 ## export configuration
 - fixed sequence length: 32 tokens
@@ -51,12 +63,12 @@ use this export for text-only inference tasks. if you need multimodal capabiliti
 from executorch.runtime import Runtime
 runtime = Runtime.get()
-program = runtime.load_program("gemma-3n-E4B-it-text-only.pte")
 method = program.load_method("forward")
 # input_ids shape: [1, 32] dtype: torch.long
 output = method.execute([input_ids])
-# output shape: [1, 32, 262400] dtype: torch.float16
 ```
 ## required patches

 executorch .pte export of google/gemma-3n-E4B-it for on-device mobile inference
+## available models
+| variant | dtype | size | file |
+|---------|-------|------|------|
+| bf16 | bfloat16 | 13.1 gb | Gemma3n-E4B-IT-text-only.pte |
+| int8 | int8 weights | 9.6 gb | Gemma3n-E4B-text-only-int8.pte |
 ## model details
 | property | value |
 | source model | google/gemma-3n-E4B-it |
 | text parameters | 7.40b |
 | transformer layers | 35 |
 | format | executorch .pte |
 ## text-only export
 use this export for text-only inference tasks. if you need multimodal capabilities use the original huggingface model
+## quantization
+- **bf16**: full bfloat16 precision weights
+- **int8**: int8 weight-only quantization via torchao - recommended for mobile deployment
+note: int4 quantization requires gpu for inference and is not suitable for cpu-only mobile deployment
 ## export configuration
 - fixed sequence length: 32 tokens
 from executorch.runtime import Runtime
 runtime = Runtime.get()
+program = runtime.load_program("Gemma3n-E4B-text-only-int8.pte")
 method = program.load_method("forward")
 # input_ids shape: [1, 32] dtype: torch.long
 output = method.execute([input_ids])
+# output shape: [1, 32, 262400] dtype: torch.bfloat16
 ```
 ## required patches