Instructions to use nikitayev/gemma-4-E4B-it-1M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nikitayev/gemma-4-E4B-it-1M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="nikitayev/gemma-4-E4B-it-1M") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("nikitayev/gemma-4-E4B-it-1M") model = AutoModelForImageTextToText.from_pretrained("nikitayev/gemma-4-E4B-it-1M") - llama-cpp-python
How to use nikitayev/gemma-4-E4B-it-1M with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="nikitayev/gemma-4-E4B-it-1M", filename="google_gemma-4-E4B-it-q8_0.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use nikitayev/gemma-4-E4B-it-1M with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf nikitayev/gemma-4-E4B-it-1M:Q8_0 # Run inference directly in the terminal: llama-cli -hf nikitayev/gemma-4-E4B-it-1M:Q8_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf nikitayev/gemma-4-E4B-it-1M:Q8_0 # Run inference directly in the terminal: llama-cli -hf nikitayev/gemma-4-E4B-it-1M:Q8_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf nikitayev/gemma-4-E4B-it-1M:Q8_0 # Run inference directly in the terminal: ./llama-cli -hf nikitayev/gemma-4-E4B-it-1M:Q8_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf nikitayev/gemma-4-E4B-it-1M:Q8_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf nikitayev/gemma-4-E4B-it-1M:Q8_0
Use Docker
docker model run hf.co/nikitayev/gemma-4-E4B-it-1M:Q8_0
- LM Studio
- Jan
- vLLM
How to use nikitayev/gemma-4-E4B-it-1M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nikitayev/gemma-4-E4B-it-1M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nikitayev/gemma-4-E4B-it-1M", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/nikitayev/gemma-4-E4B-it-1M:Q8_0
- SGLang
How to use nikitayev/gemma-4-E4B-it-1M with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nikitayev/gemma-4-E4B-it-1M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nikitayev/gemma-4-E4B-it-1M", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nikitayev/gemma-4-E4B-it-1M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nikitayev/gemma-4-E4B-it-1M", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use nikitayev/gemma-4-E4B-it-1M with Ollama:
ollama run hf.co/nikitayev/gemma-4-E4B-it-1M:Q8_0
- Unsloth Studio new
How to use nikitayev/gemma-4-E4B-it-1M with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for nikitayev/gemma-4-E4B-it-1M to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for nikitayev/gemma-4-E4B-it-1M to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for nikitayev/gemma-4-E4B-it-1M to start chatting
- Pi new
How to use nikitayev/gemma-4-E4B-it-1M with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf nikitayev/gemma-4-E4B-it-1M:Q8_0
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "nikitayev/gemma-4-E4B-it-1M:Q8_0" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use nikitayev/gemma-4-E4B-it-1M with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf nikitayev/gemma-4-E4B-it-1M:Q8_0
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default nikitayev/gemma-4-E4B-it-1M:Q8_0
Run Hermes
hermes
- Docker Model Runner
How to use nikitayev/gemma-4-E4B-it-1M with Docker Model Runner:
docker model run hf.co/nikitayev/gemma-4-E4B-it-1M:Q8_0
- Lemonade
How to use nikitayev/gemma-4-E4B-it-1M with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull nikitayev/gemma-4-E4B-it-1M:Q8_0
Run and chat with the model
lemonade run user.gemma-4-E4B-it-1M-Q8_0
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
)
Hugging Face |
GitHub |
Launch Blog |
Documentation
License: Apache 2.0 | Authors: Google DeepMind
Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages.
Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: E2B, E4B, 26B A4B, and 31B. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI.
Gemma 4 introduces key capability and architectural advancements:
Reasoning – All models in the family are designed as highly capable reasoners, with configurable thinking modes.
Extended Multimodalities – Processes Text, Image with variable aspect ratio and resolution support (all models), Video, and Audio (featured natively on the E2B and E4B models).
Diverse & Efficient Architectures – Offers Dense and Mixture-of-Experts (MoE) variants of different sizes for scalable deployment.
Optimized for On-Device – Smaller models are specifically designed for efficient local execution on laptops and mobile devices.
Increased Context Window – The small models feature a 128K context window, while the medium models support 256K.
Enhanced Coding & Agentic Capabilities – Achieves notable improvements in coding benchmarks alongside native function-calling support, powering highly capable autonomous agents.
Native System Prompt Support – Gemma 4 introduces native support for the
systemrole, enabling more structured and controllable conversations.
# Gemma-4-E4B-it LongRoPE 1M GGUF Q8_0
**Model with extended context window, based on `google/gemma-4-E4B-it` using the LongRoPE method.**
- 🧠 **Base model:** [google/gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it)
- 📏 **Original context:** 128K tokens
- 🚀 **Extended context:** 1,048,576 tokens (1M) via **LongRoPE**
- 📦 **Format:** GGUF, quantization **Q8_0**
- ⚙️ **Compatibility:** LM Studio, llama.cpp, and other GGUF‑compatible engines
---
## 🔍 Description
This version of the model was obtained by converting the official instruction-tuned `google/gemma-4-E4B-it` into the universal GGUF format and then extending the context window using the **LongRoPE** technique.
The original context length was 128 thousand tokens; after applying LongRoPE, the model can handle up to **1 million tokens** of continuous dialogue.
Quantization is performed in 8-bit `Q8_0` format, offering a good balance between quality and performance.
> ⚠️ **Important:** Extending the context by interpolating positional embeddings inevitably affects quality. The model has become somewhat “dumber” compared to the original, especially on complex multi-step reasoning tasks. However, with a proper set of parameters and Flash Attention disabled, it delivers satisfactory results on standard tasks.
---
## 📊 Performance
Test system:
| Component | Specification |
|-----------|---------------|
| CPU | 2× Intel Xeon E5-2695 v4 @ 2.10GHz (AVX, AVX2) |
| RAM | 512 GB |
| GPU | NVIDIA GeForce RTX 3060 12 GB (CUDA 12.9, Compute Capability 8.6) |
**Inference speed:**
- **LM Studio 0.4.12 (Build 1)**: stable **~21 tokens/s**
- **llama.cpp (server, no CPU offload)**:
- Start: **34 tokens/s**
- End of context fill: drops to **18 tokens/s**
---
## 🧩 Recommended settings
### For LM Studio
Create a preset named, e.g., “BEST”, and set the following parameters:
```json
{
"identifier": "@local:best",
"name": "BEST",
"changed": true,
"operation": {
"fields": [
{ "key": "llm.prediction.temperature", "value": 1.3 },
{ "key": "llm.prediction.contextOverflowPolicy", "value": "rollingWindow" },
{ "key": "llm.prediction.llama.cpuThreads", "value": 32 },
{ "key": "llm.prediction.topKSampling", "value": 500 },
{ "key": "llm.prediction.repeatPenalty", "value": { "checked": true, "value": 1 } },
{ "key": "llm.prediction.llama.presencePenalty", "value": { "checked": true, "value": 0 } },
{ "key": "llm.prediction.topPSampling", "value": { "checked": true, "value": 0.99 } },
{ "key": "llm.prediction.minPSampling", "value": { "checked": true, "value": 0.05 } }
]
},
"load": {
"fields": []
}
}
- Temperature is recommended in the range 1.0 – 1.3.
- Keep Flash Attention disabled — with it, the model degrades more.
For llama.cpp (server)
Example launch, under which the model reliably solves logical “Einstein puzzles”:
"E:\LLM\llama.cpp\build\bin\llama-server.exe" \
-m "C:/LLM/Nikitayev/google_gemma-4-E4B-it/google_gemma-4-E4B-it-q8_0.gguf" \
--mmproj "C:/LLM/lmstudio-community/gemma-4-E4B-it-GGUF/mmproj-gemma-4-E4B-it-BF16.gguf" \
--host 127.0.0.1 --port 8080 \
--timeout 60000 --threads-http -1 \
--ctx-size 1048576 \
--flash-attn on --fit off --kv-offload \
--mmap --cont-batching --webui --jinja --embedding --metrics --slots --cache-prompt --mlock \
--reasoning-format auto \
--temp 0.75 --dynatemp-range 0.75 \
--top-k 10000 --top-p 0.99 --min-p 0.05 \
--xtc-probability 0 \
--repeat-penalty 1.0 --presence-penalty 0.0 --frequency-penalty 0.0 \
--dry-multiplier 0.0 \
--samplers "penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature" \
--n-predict 8192 --seed 0
Note: The
--flash-attn onflag is left here because in some llama.cpp scenarios the combination of flash attention + sampling parameters works better than in LM Studio. Try--flash-attn offif you experience instability.
🧪 Observations and Conclusions
- Extending the context with LongRoPE leads to a noticeable but not critical drop in the model’s “intelligence”.
- With Flash Attention disabled, the quality of answers on standard conversational tasks remains acceptable.
- On complex logical tasks, the model remains capable with carefully chosen sampling parameters (see examples above).
- Using GPU offloading is critical; CPU-only inference drops speed dramatically. The RTX 3060 with 12 GB allows loading all model weights into VRAM.
📁 Files
google_gemma-4-E4B-it-q8_0.gguf– main GGUF Q8_0 model with extended context.mmproj-gemma-4-E4B-it-BF16.gguf– multimodal embedding projector (original, BF16), required for the full pipeline.
👤 Author
Nikitayev
📧 nikitayev@mail.ru
📧 nikitayev1979@gmail.com
This model was created for research purposes. Distributed under the terms of the original Google Gemma model license. ```
- Downloads last month
- 1,386
8-bit
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="nikitayev/gemma-4-E4B-it-1M", filename="google_gemma-4-E4B-it-q8_0.gguf", )