Instructions to use google/gemma-4-E2B-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-4-E2B-it with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("google/gemma-4-E2B-it") model = AutoModelForImageTextToText.from_pretrained("google/gemma-4-E2B-it") - Notebooks
- Google Colab
- Kaggle
Issue with llama.cpp
% llama-server -hf ggml-org/gemma-4-E2B-it-GGUF
load_backend: loaded BLAS backend from /opt/homebrew/Cellar/ggml/0.9.11/libexec/libggml-blas.so
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.011 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 38654.71 MB
load_backend: loaded MTL backend from /opt/homebrew/Cellar/ggml/0.9.11/libexec/libggml-metal.so
load_backend: loaded CPU backend from /opt/homebrew/Cellar/ggml/0.9.11/libexec/libggml-cpu-apple_m4.so
common_download_file_single_online: HEAD failed, status: 404
no remote preset found, skipping
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
system info: n_threads = 10, n_threads_batch = 10, total_threads = 14
system_info: n_threads = 10 (n_threads_batch = 10) / 14 | MTL : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | SME = 1 | ACCELERATE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 13 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model '/Users/michael/.cache/huggingface/hub/models--ggml-org--gemma-4-E2B-it-GGUF/snapshots/4b90c7b785141802608550fc3cd3c715201532e2/gemma-4-e2b-it-Q8_0.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4'
llama_model_load_from_file_impl: failed to load model
llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model
llama_params_fit: fitting params to free memory took 0.18 seconds
llama_model_load_from_file_impl: using device MTL0 (Apple M4 Pro) (unknown id) - 36863 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 601 tensors from /Users/michael/.cache/huggingface/hub/models--ggml-org--gemma-4-E2B-it-GGUF/snapshots/4b90c7b785141802608550fc3cd3c715201532e2/gemma-4-e2b-it-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma4
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 64
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 4: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 5: general.size_label str = 4.6B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: general.license.link str = https://ai.google.dev/gemma/docs/gemm...
llama_model_loader: - kv 8: general.tags arr[str,1] = ["any-to-any"]
llama_model_loader: - kv 9: gemma4.block_count u32 = 35
llama_model_loader: - kv 10: gemma4.context_length u32 = 131072
llama_model_loader: - kv 11: gemma4.embedding_length u32 = 1536
llama_model_loader: - kv 12: gemma4.feed_forward_length arr[i32,35] = [6144, 6144, 6144, 6144, 6144, 6144, ...
llama_model_loader: - kv 13: gemma4.attention.head_count u32 = 8
llama_model_loader: - kv 14: gemma4.attention.head_count_kv u32 = 1
llama_model_loader: - kv 15: gemma4.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 16: gemma4.rope.freq_base_swa f32 = 10000.000000
llama_model_loader: - kv 17: gemma4.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 18: gemma4.attention.key_length u32 = 512
llama_model_loader: - kv 19: gemma4.attention.value_length u32 = 512
llama_model_loader: - kv 20: gemma4.final_logit_softcapping f32 = 30.000000
llama_model_loader: - kv 21: gemma4.attention.sliding_window u32 = 512
llama_model_loader: - kv 22: gemma4.attention.shared_kv_layers u32 = 20
llama_model_loader: - kv 23: gemma4.embedding_length_per_layer_input u32 = 256
llama_model_loader: - kv 24: gemma4.attention.sliding_window_pattern arr[bool,35] = [true, true, true, true, false, true,...
llama_model_loader: - kv 25: gemma4.attention.key_length_swa u32 = 256
llama_model_loader: - kv 26: gemma4.attention.value_length_swa u32 = 256
llama_model_loader: - kv 27: gemma4.rope.dimension_count u32 = 512
llama_model_loader: - kv 28: gemma4.rope.dimension_count_swa u32 = 256
llama_model_loader: - kv 29: tokenizer.ggml.model str = gemma4
llama_model_loader: - kv 30: tokenizer.ggml.tokens arr[str,262144] = ["", "", "", "", ...
llama_model_loader: - kv 31: tokenizer.ggml.scores arr[f32,262144] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 32: tokenizer.ggml.token_type arr[i32,262144] = [3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 33: tokenizer.ggml.merges arr[str,514906] = ["\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n", ...
llama_model_loader: - kv 34: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 36: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 37: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 38: tokenizer.ggml.mask_token_id u32 = 4
llama_model_loader: - kv 39: tokenizer.chat_template str = {%- macro format_parameters(propertie...
llama_model_loader: - kv 40: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 41: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: general.file_type u32 = 7
llama_model_loader: - type f32: 283 tensors
llama_model_loader: - type f16: 1 tensors
llama_model_loader: - type q8_0: 317 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 4.61 GiB (8.52 BPW)
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/Users/michael/.cache/huggingface/hub/models--ggml-org--gemma-4-E2B-it-GGUF/snapshots/4b90c7b785141802608550fc3cd3c715201532e2/gemma-4-e2b-it-Q8_0.gguf'
srv load_model: failed to load model, '/Users/michael/.cache/huggingface/hub/models--ggml-org--gemma-4-E2B-it-GGUF/snapshots/4b90c7b785141802608550fc3cd3c715201532e2/gemma-4-e2b-it-Q8_0.gguf'
srv operator(): operator(): cleaning up before exit...
main: exiting due to model loading error
Considering the "llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4'" error message, it looks like you didn't update llama.cpp
I just checked the llama.cpp GitHub repo and while development for Gemma 4 is in progress, it has not been publicly released yet
Thanks for checking. I tried updating since that was my first thought.
It works now after updating again.