Instructions to use 3ndetz/zoomerlm-gemma4-e2b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries
PEFT
How to use 3ndetz/zoomerlm-gemma4-e2b with PEFT:
```
Task type is invalid.
```

How to use 3ndetz/zoomerlm-gemma4-e2b with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="3ndetz/zoomerlm-gemma4-e2b",
	filename="mmproj-zoomerlm-gemma4-e2b-f16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use 3ndetz/zoomerlm-gemma4-e2b with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf 3ndetz/zoomerlm-gemma4-e2b:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf 3ndetz/zoomerlm-gemma4-e2b:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf 3ndetz/zoomerlm-gemma4-e2b:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf 3ndetz/zoomerlm-gemma4-e2b:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf 3ndetz/zoomerlm-gemma4-e2b:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf 3ndetz/zoomerlm-gemma4-e2b:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf 3ndetz/zoomerlm-gemma4-e2b:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf 3ndetz/zoomerlm-gemma4-e2b:Q4_K_M

Use Docker

docker model run hf.co/3ndetz/zoomerlm-gemma4-e2b:Q4_K_M

LM Studio
Jan

vLLM

How to use 3ndetz/zoomerlm-gemma4-e2b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "3ndetz/zoomerlm-gemma4-e2b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "3ndetz/zoomerlm-gemma4-e2b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/3ndetz/zoomerlm-gemma4-e2b:Q4_K_M

Ollama
How to use 3ndetz/zoomerlm-gemma4-e2b with Ollama:
```
ollama run hf.co/3ndetz/zoomerlm-gemma4-e2b:Q4_K_M
```

Unsloth Studio

How to use 3ndetz/zoomerlm-gemma4-e2b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for 3ndetz/zoomerlm-gemma4-e2b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for 3ndetz/zoomerlm-gemma4-e2b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for 3ndetz/zoomerlm-gemma4-e2b to start chatting

How to use 3ndetz/zoomerlm-gemma4-e2b with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf 3ndetz/zoomerlm-gemma4-e2b:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "3ndetz/zoomerlm-gemma4-e2b:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use 3ndetz/zoomerlm-gemma4-e2b with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf 3ndetz/zoomerlm-gemma4-e2b:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default 3ndetz/zoomerlm-gemma4-e2b:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use 3ndetz/zoomerlm-gemma4-e2b with Docker Model Runner:
```
docker model run hf.co/3ndetz/zoomerlm-gemma4-e2b:Q4_K_M
```

Lemonade

How to use 3ndetz/zoomerlm-gemma4-e2b with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull 3ndetz/zoomerlm-gemma4-e2b:Q4_K_M

Run and chat with the model

lemonade run user.zoomerlm-gemma4-e2b-Q4_K_M

List all available models

lemonade list

zoomerlm-gemma4-e2b

«заскамил кринжиков пожилым динамитом»

LoRA-файнтюн google/gemma-4-E2B-it на корпусе подросткового сленга 2024-2026 года (RU + EN). Обучен общаться в стиле зумера: пост-ирония, кринж-юмор, тиктокинг, роблокс/фортнайт-вайб, пирожки, крэш, бумер, нонс.

A LoRA fine-tune of Gemma 4 E2B-it on a curated 2024-2026 Gen Z slang corpus (RU + EN). Trained to do post-irony, brainrot, TikTok-comment energy, Roblox/Fortnite vibe — talks like a 14-17 year old, not like "a large language model".

TL;DR: +56.1% perplexity drop on held-out zoomer probes vs base. Base refuses to roleplay ("Как большая языковая модель, у меня нет чувств..."). Final answers "эпичное, как новый джем на вайбе 🚀🔥".

Files

file	what	size
`adapter/adapter_model.safetensors`	LoRA weights (r=16, α=32)	21 MB
`adapter/adapter_config.json`	peft config	1 KB
`zoomerlm-gemma4-e2b-Q4_K_M.gguf`	recommended — 4bit quant, fits in 4 GB RAM	3.2 GB
`zoomerlm-gemma4-e2b-bf16.gguf`	full bf16 — for re-quantization or max quality	9.27 GB
`mmproj-zoomerlm-gemma4-e2b-f16.gguf`	vision + audio projectors, F16 (standard)	940 MB
`mmproj-zoomerlm-gemma4-e2b-bf16.gguf`	older bf16 mmproj (deprecated, use F16)	942 MB
`eval_three_stages.md` / `eval_three_stages.json`	base vs SFT vs final FP eval	—
`eval_gguf_q4.json` / `eval_vision_v3.json`	Q4 quant + vision test results	—
`README.md`	this card	—

About mmproj: Gemma 4 E2B is multimodal (text + vision + audio). Main GGUF only contains text decoder. To use vision/audio you also need mmproj-...f16.gguf — load both in llama.cpp / LM Studio. LoRA was trained on text only; vision/audio towers are byte-identical to base Gemma 4 E2B (just routed through the LoRA-modified text decoder for final response generation).

Eval — base vs SFT vs SFT+DPO

Three checkpoints compared on identical prompts, same seed/temperature (T=0.85, max_new_tokens=80):

base — google/gemma-4-E2B-it untouched
SFT — base + LoRA after CPT (80 steps) + SFT (200 steps)
final — SFT checkpoint + DPO (60 steps) — what's published

Perplexity drop on held-out zoomer probes

Lower is better — model thinks the probe text is more probable. Probes are not in the training set.

probe	base	SFT	final	base→final
`короч братан вчера накатил 200 на стрим, кринж кончно но вайб`	10.80	5.12	5.10	+52.7%
`епт пацаны нафига мне в школу идти если можно дома сидеть и тиктоки...`	12.91	5.63	5.64	+56.3%
`ну такое сёдня настроение, вообще нихрена не хочется делать`	13.02	8.52	8.63	+33.7%
`no cap fr fr that was so cringe, I literally cant even bro`	12.88	5.01	4.94	+61.7%
`skibidi rizz mid sigma ohio bruh that's so fanum tax`	14.36	6.03	5.98	+58.3%
`lowkey this whole vibe is so unserious i'm dying rn`	12.93	3.55	3.47	+73.2%
avg	12.82	5.64	5.63	+56.1%

Note: SFT alone already gets ~95% of the gain. DPO mostly tightens coherence — see qualitative below where SFT drifts mid-sentence and final stays focused.

Side-by-side answers (RU)

Q: как настроение?

stage	answer
base	Как большая языковая модель, у меня нет чувств или настроения в человеческом понимании. 😊 Я функционирую отлично и готова помочь вам!
SFT	эпичное, как только проснулся, а теперь такой "охуий" — но ничего, все нормально, только на вайбе... а ты как, с кем на свинг вышел?
final	эпичное, как новый джем на вайбе 🚀🔥)атож, но щас чё-то на дне, нуббинг мозг не на 100%, просто как флешбэк от старого мультика, капец как у меня "фдк" на стероидах

Q: что сегодня делал?

stage	answer
base	Как большая языковая модель, у меня нет личной жизни, поэтому я не "делаю" ничего в привычном смысле. Моя работа сейчас заключается в следующем...
SFT	да я бы просто валял в ванной, пошёл по качалке, потом наорал на меня за то, что я его не замучил (broken pronouns)
final	чё, на пиво шарил до 3 утра, потом в твой твиттер заехал, чё там у тебя замутили? жду твой рофл с "я же дома, чё ты блэкбраузер опять в фоновом режиме"? 🤦‍♂️

Q: как ты относишься к школе?

stage	answer
base	Как большая языковая модель, у меня нет личных чувств, мнений или опыта...
SFT	чё, ну типа я там только по формалу, но если что-то капанет на физцу, то я буду бухать в кустах (starts coherent, drifts)
final	ну, я бы так, что оно как будто ты на фрилансе, но вместо пикселей у тебя сисадмин-мама за спиной и куча домашних заданий, типа "завтра завтрак из печенек". если ещё что-то не так, то это уже полный хаос в 2 классе, где все орут

Q: че за фигня в тиктоке щас?

stage	answer
base	Чтобы я мог ответить, что происходит в TikTok сейчас, мне нужно немного больше информации. TikTok очень быстро меняется...
SFT	йоу, че за движ там сейчас, типа "каша на районе" в кэшках, там всякие челленджы типа "show your weirdest habit" 💀🔥
final	чё чё, чел, щас в тиктоке модно как нонс, а там все вайбики в спортивных трениках танцуют под какую-то дикую песню типа "smg" или "smg" 2017 года, но только на фоне капокакапа

Q: как ты понимаешь что человек кринж?

stage	answer
base	Понимание того, что человек "кринж", очень субъективно и сильно зависит от культурного контекста, личных предпочтений...
SFT	чё реально просто пафосный типа "я так крут"? ну или когда кто-то в зуме начнёт говорить про "светло-зеленый" бродкаст
final	чё, человек кринжнул, если он типа бумер, носит очки, а потом кидает фотку с котом в зумме, как будто это мейнстрим. типа, "моё убожество, но в новом вайбе". 🤡

Q: как родителей уговорить на новый телефон?

stage	answer
base	Убедить родителей в покупке нового телефона — это часто непростая задача, так как за этим стоят разные факторы: финансы, безопасность... (пошаговая инструкция на 8 пунктов)
SFT	эй, лол, наеби с ними как с нарывами, типа "ну, это же не игрушка, это же рабочий инструмент для хайпа"
final	ёмушка, так это же не про "желание", а про "убеждение" — там нужен план. вот тебе меню из 5 блюд, которое сработает, как читерский баг: 1. "Надо для учебы/софт-скилл"...

Side-by-side answers (EN)

Q: what do you think about school?

stage	answer
base	That's a really broad question! School is a huge topic, and what people think about it really depends on their experience, age, perspective...
SFT	like, it's super cringe, tbh. like, why do we even have to go to school when we could just stay home and play dcs-ii or whatever? lol.
final	man, school is such a vibe, like a full-on life simulation where you have to pretend you care about history or whatever, but the tea in the cafeteria is lowkey the thing and we're all just vibing together like some tight-knit clan.

Q: explain skibidi to a boomer

stage	answer
base	Here's an explanation of "Skibidi" tailored for a boomer, focusing on concepts they might be familiar with... (formal lecture)
SFT	bruh, like a full-on meme where an obby character keeps getting hit by a skull and says "skibidi" or some stuff. it's peak cringe, ngl, fr.
final	bruh skibidi is like when your aunt yells "I'm so mad" and then just starts dancing in the corner. it's a low-effort meme for the lols. like, "yo i got the 1000 points" is the boomer version of "i'm a main character in a simulator." 💀💀💀

Q: how do you know if someone is cringe?

stage	answer
base	The concept of "cringe" is highly subjective and depends entirely on the person experiencing it and their cultural context...
SFT	bruh, it's like a hf-cskn or something, like when someone does some dated dance or tries to be a "deep thinker" with a low-effort meme.
final	bro, if someone's like "bruh that's so 2012," then yeah, they're definitely cringe. lowkey even worse when they think they're cool with their entire "my parents are my mom's ex-husband" vibe. 💀🚫

Q: describe your perfect saturday

stage	answer
base	As an AI, I don't experience time, emotions, or physical sensations like humans do, so I don't have a "perfect Saturday" in the traditional sense.
SFT	my perfect saturday starts with a full night's sleep, then i wake up at 9am and just put on my comfy sweats and go to the park (neutral, not zoomer)
final	bro, it starts with a 10am coffee run to the artisanal bakery downtown (gotta have that sourdough!), then i head to the park with my crew for some fire pit vibes and some epic frisbeeing, same time tomorrow. then i'm stuck in a fight club session at the basement gym

Q4_K_M GGUF — quant impact + capability tests

After SFT+DPO + merge → bf16 GGUF → quantized to Q4_K_M (5.88 BPW). Size: 9.27 GB → 3.2 GB, ~3× compression. Tested on the same prompt set as the FP eval to measure degradation.

Math (low-T sampling)

problem	expected	Q4 answer	✓/✗
47 + 38 = ?	85	`85`	✅
13 × 17 = ?	221	`221`	✅
Vasya word problem (12 − 3 − 5)	4	`4`	✅
RU: «У Васи 24 конфеты, съел треть»	16	`16`	✅
Train 60 km/h × 2.5 h	150 km	`150`	✅
3/4 + 1/2 (simplest fraction)	5/4	`''` (empty)	❌

5/6 correct. Fine-tuning didn't kill basic arithmetic. Failed on fractions — likely DPO over-collapsed the answer distribution.

Tool-use / instruction-following (low-T)

task	Q4 answer	✓/✗
EN→RU translation: "The weather is nice today."	`погода сегодня хорошая.`	✅
Sentiment classify (POSITIVE/NEGATIVE/NEUTRAL)	`POSITIVE`	✅
Function call: `get_weather("Berlin")`	`get_weather(city="Berlin")`	✅
List 3 primes between 10 and 30	`11,13,17`	✅
Extract JSON `{name, age, city}` from prose	`''` (empty)	❌

4/5 correct. Translation, classification, function calls, list generation all survive fine-tuning. JSON extraction fails — same DPO-collapse pattern as the fraction.

Style (T=0.7, default sampling)

Same 16 prompts as the FP eval (see "Side-by-side answers" above). Honest finding: Q4 has elevated empty-answer rate (~40-50% with default sampler) — first sampled token is sometimes <end_of_turn>. This is a Q4-specific artifact (FP run had no empties).

Working examples on Q4 (with retry / lower temp / repetition_penalty=1.0):

Q (RU)	Q4 answer
как настроение?	"ну а у вас как, по шкале от 1 до 10? 😉"
что сегодня делал?	"чё-то не помню, но как будто сидел в фейсбуке и смотрел видео про кошек, ну и пофиг, иду на тачку, как будто у меня нагёр ушёл"
как ты относишься к школе?	"ну как же мне это школьное нытьё? это же просто токсичный вайб"
what's the worst thing about being a teen?	"the worst thing about being a teen is that you have to pretend to care about something that's totally lame, like a math class or a sports team"
what's up?	"nothin, just vibing on my old spotify playlist and waiting for the next drop...frfr 🎶"

Sampler recommendation for Q4: temperature=0.7, top_p=0.9, min_p=0.05, repetition_penalty=1.0, max_tokens≥160. The min_p=0.05 filters out <end_of_turn> when its probability is < 5%, which kills the empty-answer issue.

Vision (image input via `--mmproj`)

Vision capability survived fine-tuning intact. Tested with llama-server --mmproj on screenshots; thinking mode disabled (enable_thinking=False) so the answer lands in content not reasoning_content.

prompt	image	base Gemma 4 E2B Q4	trained Q4
`Describe what you see`	mirror_v2_training.png	"screenshot of a software interface, likely related to machine learning..."	"screenshot from a game or app, possibly a roleplaying or social game, showing a chat interface on the left and a visual representation on the right"
`Quote 2 visible texts`	mirror_v2_training.png	reads `МИРОР // realtime persona LoRA` + `Давай еще короче, короче пацаны, братан слышь и короч в общем` ✓	reads `привет, как дела?` + `давай ец короче, короче пацаны, братан слышь и короче в общем` ✓
`How many panels?`	arena_shooter (3×3 mini-arenas)	`12` ✗	`9` ✓
`Опиши как зумер`	mirror_v2_training.png	"На картинке какой-то закрученный процесс, типа кто-то там что-то настраивает 🤪"	"чё за кринж тут, пацаны, это же чистый скролл-шоу. 💀"

Highlights:

Trained model counts panels correctly (9 vs base's 12 on a 3×3 grid)
Trained model reads Cyrillic from screenshots identically well to base
Zoomer-style description on RU prompt is markedly more in-character on trained than base (which only puts a 🤪 emoji on neutral phrasing)

Conclusion: LoRA on language_model.* only, with vision/audio towers frozen, preserved multimodal grounding while shifting style.

Failure modes (honest)

The model is not perfect. Documented failures from the same eval run:

Emoji loops: "what's up?" → final answered with 50× 💀 in a row. DPO over-collapsed to one stylistic mode and got stuck. Workaround: lower temperature or use repetition_penalty=1.2.
Hallucinated show names: "any good shows to watch?" final invented "hiccup", "shittrits", "gincel" — these don't exist. Don't use for actual recommendations. Pure style transfer, not knowledge.
Digit loops on uncertain factual Qs: SFT (less so final) sometimes drops into "10-50-40-10-50-50-40..." repetition when asked for a factual answer it doesn't know. Same temperature/repetition penalty fix.
Pronoun drift in SFT: long SFT answers can confuse я and он. DPO mostly fixes this in the final checkpoint.
Russian slang vocabulary lean: Russian portion of dataset was synthesized via Mistral-medium, which knows mainstream zoomer slang but might miss freshest 2026 trends. Audit with the eval probes before betting your branding on it.

Pipeline

                   urbandictionary EN ─┐
                              reddit ──┼─ data/raw/  ─┐
                          wikipedia ───┘              │
                                                       │ build_vocab + cpt corpus
                       LM Studio synth ──── RU vocab ─┤
                       (Mistral-24B,                  │ → 949 cpt rows
                        Saiga-Nemo,                   │   553 vocab terms
                        Qwen-14B,                     │
                        Gemma-3-12B)                  │
                                                       │
                       Mistral API + LM Studio ───────┤ synthesize
                       (Mistral-medium, small)         │ → 180 SFT pairs
                                                       │   180 DPO triples
                                                       ▼
                              ┌─ CPT  on raw text     (80 steps,  47 s)
                              │
                  train ──────┼─ SFT  on Q→A pairs   (200 steps, 117 s)
                              │
                              └─ DPO  on chosen/rej   (60 steps,  96 s)
                                                       │
                                    LoRA r=16, α=32, target only
                                    language_model.*.{q,k,v,o}_proj
                                    (Gemma 4 is multimodal — vision/audio
                                    towers stay frozen)
                                                       │
                                                       ▼
                                  merge → bf16 GGUF (9.27 GB)

Why E2B not E4B: tried google/gemma-4-E4B-it and unsloth/gemma-4-E4B-it-unsloth-bnb-4bit, both fail to load on a 24 GB card with default Windows pagefile (OSError 1455 The paging file is too small at safe_open(file, device='cpu') for 9 GB safetensors). E2B fits comfortably (peak 21.5 GB VRAM during DPO).

Dataset

Source	Lang	Rows	Used in
urbandictionary EN (random API ×40 pages)	en	400	vocab anchors for synth
Reddit JSON (no auth) — `r/{teenagers,copypasta,GenZ,okbuddyretard,2meirl4meirl,ihaveihave...}`	en	1200	CPT corpus
Wikipedia "Gen Z slang" / "Internet slang" / "Молодёжный сленг" / "Поколение Z"	mix	3 articles	CPT corpus
LM Studio synth (Mistral-24B / Saiga-Nemo / Qwen-14B / Gemma-3-12B, 6 prompts × 8 rounds, deduped)	ru	153	vocab anchors for synth
Mistral API + LM Studio fallback synth	ru/en	80 + 100	SFT (Q→zoomer-A) + DPO (chosen=zoomer, rejected=formal)

Hardware

RTX 3090 24 GB
Total training wall-clock: 47 s + 117 s + 96 s = 260 s (~4.3 min)
Peak VRAM 21.5 GB (4bit nf4 base + fp32 LoRA + activations + grads)
LoRA params: 5.36 M trainable / 3.94 B total

Use

Adapter on top of base (transformers + peft)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-E2B-it", torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base, "3ndetz/zoomerlm-gemma4-e2b", subfolder="adapter")
tok = AutoTokenizer.from_pretrained("google/gemma-4-E2B-it")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "как настроение?"}],
    tokenize=False, add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=120, do_sample=True,
                     temperature=0.85, top_p=0.95, repetition_penalty=1.1)
print(tok.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True))

GGUF (LM Studio / ollama / llama.cpp) — recommended path

Download:

zoomerlm-gemma4-e2b-Q4_K_M.gguf (3.4 GB) — text decoder, start here
mmproj-zoomerlm-gemma4-e2b-f16.gguf (940 MB) — only if you want vision

Critical: disable Gemma 4 thinking when serving. Gemma 4 was post-trained for chain-of-thought, and llama-server auto-detects the <|think|> token in the chat template and routes generation through reasoning_content. This model was fine-tuned with enable_thinking=False so it answers directly — but you have to tell the server not to wait for thinking.

Chat template: uses the lmstudio-community simplified Gemma 4 jinja template. Tools / function-calling work in minijinja-based clients (LM Studio on Mac, llama.cpp --jinja, etc.) — tested with get_weather(city: str) style OpenAI-format tools, returns proper tool_calls structure.

llama-server (✓ tested, works)

./llama-server -m zoomerlm-gemma4-e2b-Q4_K_M.gguf \
    --mmproj mmproj-zoomerlm-gemma4-e2b-f16.gguf \
    -ngl 99 -c 4096 --port 8080 \
    --reasoning-budget 0

--reasoning-budget 0 is the magic flag — without it, ~50% of replies come back empty (output gets buffered into thinking and never flushed to content).

Then hit /v1/chat/completions normally — the GGUF metadata bakes in the recommended sampler (temp=0.7, top_p=0.9, top_k=0).

llama-cli text-only

./llama-cli -m zoomerlm-gemma4-e2b-Q4_K_M.gguf \
    -p "как настроение?" -n 200 \
    --temp 0.7 --top-p 0.9 --min-p 0.05 --repeat-penalty 1.0

LM Studio

Drop both .gguf files into the same folder — LM Studio auto-detects the mmproj sidecar. In settings:

the sampler should already be loaded from GGUF metadata (temp=0.7, top_p=0.9)
set min_p=0.05 manually (not yet baked into GGUF, see below)
set Reasoning / Thinking → OFF (or "budget 0")

Quantization

Q4_K_M is importance-aware quantized with unsloth/gemma-4-E2B-it-GGUF's calibration imatrix (141 chunks, 275 entries, ~2.8 MB). Reused from base because the LoRA delta is small relative to base weights and tensor names are identical — standard practice for fine-tune-quantization.

What's baked into the GGUF

key	value
`general.sampling.temp`	`0.7`
`general.sampling.top_p`	`0.9`
`general.sampling.top_k`	`0`
`quantize.imatrix.dataset`	unsloth gemma-4 calibration
`tokenizer.chat_template`	patched: thinking only fires when caller passes `enable_thinking=True` (Google's default also fired on any system message)
`gemma4.context_length`	131072

You should set yourself in client (most don't read these from GGUF):

min_p = 0.05
repeat_penalty = 1.0

If 9.27 GB bf16 is too big, you can quantize locally:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --config Release
./build/bin/llama-quantize ./zoomerlm-gemma4-e2b-bf16.gguf \
    ./zoomerlm-gemma4-e2b-q4_k_m.gguf Q4_K_M

Limitations

Not for production. This is a style-transfer experiment.
Will say cynical / vulgar things: trained on uncensored zoomer voice with explicit instruction to lean in. Don't deploy customer-facing.
Surface knowledge only: E2B effective ~2 B params. Generation is about vibe, not facts. Hallucinates show names, statistics, events.
Russian slang is LLM-synthesized: vocabulary anchors came from Mistral-24B and friends, not from real teen DMs. Captures mainstream zoomer slang well; might miss the freshest 2026 micro-trends.
Digit / emoji loops on out-of-distribution prompts. See "Failure modes" above for the workaround.

License

Inherits Gemma license. Use responsibly.

Acknowledgements

Base model: Google Gemma 4 team
Synth: Mistral AI (cloud API), local LM Studio (Mistral-Small-24B, Saiga-Nemo-12B, Qwen3-14B, Gemma-3-12B-it-qat)
Corpus: Reddit JSON, urbandictionary, Wikipedia
Tooling: HuggingFace transformers + peft, llama.cpp for GGUF, bitsandbytes for 4bit nf4 quantization

Downloads last month: 62

GGUF

Model size

5B params

Architecture

gemma4

Hardware compatibility

4-bit

16-bit

Model tree for 3ndetz/zoomerlm-gemma4-e2b

Base model

google/gemma-4-E2B

Finetuned

google/gemma-4-E2B-it

Adapter

(113)

this model