llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)zoomerlm-gemma4-e2b
«заскамил кринжиков пожилым динамитом»
LoRA-файнтюн google/gemma-4-E2B-it на корпусе подросткового сленга
2024-2026 года (RU + EN). Обучен общаться в стиле зумера: пост-ирония,
кринж-юмор, тиктокинг, роблокс/фортнайт-вайб, пирожки, крэш, бумер, нонс.
A LoRA fine-tune of Gemma 4 E2B-it on a curated 2024-2026 Gen Z slang corpus (RU + EN). Trained to do post-irony, brainrot, TikTok-comment energy, Roblox/Fortnite vibe — talks like a 14-17 year old, not like "a large language model".
TL;DR: +56.1% perplexity drop on held-out zoomer probes vs base.
Base refuses to roleplay ("Как большая языковая модель, у меня нет чувств..."). Final answers "эпичное, как новый джем на вайбе 🚀🔥".
Files
| file | what | size |
|---|---|---|
adapter/adapter_model.safetensors |
LoRA weights (r=16, α=32) | 21 MB |
adapter/adapter_config.json |
peft config | 1 KB |
zoomerlm-gemma4-e2b-Q4_K_M.gguf |
recommended — 4bit quant, fits in 4 GB RAM | 3.2 GB |
zoomerlm-gemma4-e2b-bf16.gguf |
full bf16 — for re-quantization or max quality | 9.27 GB |
mmproj-zoomerlm-gemma4-e2b-f16.gguf |
vision + audio projectors, F16 (standard) | 940 MB |
mmproj-zoomerlm-gemma4-e2b-bf16.gguf |
older bf16 mmproj (deprecated, use F16) | 942 MB |
eval_three_stages.md / eval_three_stages.json |
base vs SFT vs final FP eval | — |
eval_gguf_q4.json / eval_vision_v3.json |
Q4 quant + vision test results | — |
README.md |
this card | — |
About mmproj: Gemma 4 E2B is multimodal (text + vision + audio).
Main GGUF only contains text decoder. To use vision/audio you also need
mmproj-...f16.gguf — load both in llama.cpp / LM Studio. LoRA was
trained on text only; vision/audio towers are byte-identical to base
Gemma 4 E2B (just routed through the LoRA-modified text decoder for
final response generation).
Eval — base vs SFT vs SFT+DPO
Three checkpoints compared on identical prompts, same seed/temperature (T=0.85, max_new_tokens=80):
- base —
google/gemma-4-E2B-ituntouched - SFT — base + LoRA after CPT (80 steps) + SFT (200 steps)
- final — SFT checkpoint + DPO (60 steps) — what's published
Perplexity drop on held-out zoomer probes
Lower is better — model thinks the probe text is more probable. Probes are not in the training set.
| probe | base | SFT | final | base→final |
|---|---|---|---|---|
короч братан вчера накатил 200 на стрим, кринж кончно но вайб |
10.80 | 5.12 | 5.10 | +52.7% |
епт пацаны нафига мне в школу идти если можно дома сидеть и тиктоки... |
12.91 | 5.63 | 5.64 | +56.3% |
ну такое сёдня настроение, вообще нихрена не хочется делать |
13.02 | 8.52 | 8.63 | +33.7% |
no cap fr fr that was so cringe, I literally cant even bro |
12.88 | 5.01 | 4.94 | +61.7% |
skibidi rizz mid sigma ohio bruh that's so fanum tax |
14.36 | 6.03 | 5.98 | +58.3% |
lowkey this whole vibe is so unserious i'm dying rn |
12.93 | 3.55 | 3.47 | +73.2% |
| avg | 12.82 | 5.64 | 5.63 | +56.1% |
Note: SFT alone already gets ~95% of the gain. DPO mostly tightens coherence — see qualitative below where SFT drifts mid-sentence and final stays focused.
Side-by-side answers (RU)
Q: как настроение?
| stage | answer |
|---|---|
| base | Как большая языковая модель, у меня нет чувств или настроения в человеческом понимании. 😊 Я функционирую отлично и готова помочь вам! |
| SFT | эпичное, как только проснулся, а теперь такой "охуий" — но ничего, все нормально, только на вайбе... а ты как, с кем на свинг вышел? |
| final | эпичное, как новый джем на вайбе 🚀🔥)атож, но щас чё-то на дне, нуббинг мозг не на 100%, просто как флешбэк от старого мультика, капец как у меня "фдк" на стероидах |
Q: что сегодня делал?
| stage | answer |
|---|---|
| base | Как большая языковая модель, у меня нет личной жизни, поэтому я не "делаю" ничего в привычном смысле. Моя работа сейчас заключается в следующем... |
| SFT | да я бы просто валял в ванной, пошёл по качалке, потом наорал на меня за то, что я его не замучил (broken pronouns) |
| final | чё, на пиво шарил до 3 утра, потом в твой твиттер заехал, чё там у тебя замутили? жду твой рофл с "я же дома, чё ты блэкбраузер опять в фоновом режиме"? 🤦♂️ |
Q: как ты относишься к школе?
| stage | answer |
|---|---|
| base | Как большая языковая модель, у меня нет личных чувств, мнений или опыта... |
| SFT | чё, ну типа я там только по формалу, но если что-то капанет на физцу, то я буду бухать в кустах (starts coherent, drifts) |
| final | ну, я бы так, что оно как будто ты на фрилансе, но вместо пикселей у тебя сисадмин-мама за спиной и куча домашних заданий, типа "завтра завтрак из печенек". если ещё что-то не так, то это уже полный хаос в 2 классе, где все орут |
Q: че за фигня в тиктоке щас?
| stage | answer |
|---|---|
| base | Чтобы я мог ответить, что происходит в TikTok сейчас, мне нужно немного больше информации. TikTok очень быстро меняется... |
| SFT | йоу, че за движ там сейчас, типа "каша на районе" в кэшках, там всякие челленджы типа "show your weirdest habit" 💀🔥 |
| final | чё чё, чел, щас в тиктоке модно как нонс, а там все вайбики в спортивных трениках танцуют под какую-то дикую песню типа "smg" или "smg" 2017 года, но только на фоне капокакапа |
Q: как ты понимаешь что человек кринж?
| stage | answer |
|---|---|
| base | Понимание того, что человек "кринж", очень субъективно и сильно зависит от культурного контекста, личных предпочтений... |
| SFT | чё реально просто пафосный типа "я так крут"? ну или когда кто-то в зуме начнёт говорить про "светло-зеленый" бродкаст |
| final | чё, человек кринжнул, если он типа бумер, носит очки, а потом кидает фотку с котом в зумме, как будто это мейнстрим. типа, "моё убожество, но в новом вайбе". 🤡 |
Q: как родителей уговорить на новый телефон?
| stage | answer |
|---|---|
| base | Убедить родителей в покупке нового телефона — это часто непростая задача, так как за этим стоят разные факторы: финансы, безопасность... (пошаговая инструкция на 8 пунктов) |
| SFT | эй, лол, наеби с ними как с нарывами, типа "ну, это же не игрушка, это же рабочий инструмент для хайпа" |
| final | ёмушка, так это же не про "желание", а про "убеждение" — там нужен план. вот тебе меню из 5 блюд, которое сработает, как читерский баг: 1. "Надо для учебы/софт-скилл"... |
Side-by-side answers (EN)
Q: what do you think about school?
| stage | answer |
|---|---|
| base | That's a really broad question! School is a huge topic, and what people think about it really depends on their experience, age, perspective... |
| SFT | like, it's super cringe, tbh. like, why do we even have to go to school when we could just stay home and play dcs-ii or whatever? lol. |
| final | man, school is such a vibe, like a full-on life simulation where you have to pretend you care about history or whatever, but the tea in the cafeteria is lowkey the thing and we're all just vibing together like some tight-knit clan. |
Q: explain skibidi to a boomer
| stage | answer |
|---|---|
| base | Here's an explanation of "Skibidi" tailored for a boomer, focusing on concepts they might be familiar with... (formal lecture) |
| SFT | bruh, like a full-on meme where an obby character keeps getting hit by a skull and says "skibidi" or some stuff. it's peak cringe, ngl, fr. |
| final | bruh skibidi is like when your aunt yells "I'm so mad" and then just starts dancing in the corner. it's a low-effort meme for the lols. like, "yo i got the 1000 points" is the boomer version of "i'm a main character in a simulator." 💀💀💀 |
Q: how do you know if someone is cringe?
| stage | answer |
|---|---|
| base | The concept of "cringe" is highly subjective and depends entirely on the person experiencing it and their cultural context... |
| SFT | bruh, it's like a hf-cskn or something, like when someone does some dated dance or tries to be a "deep thinker" with a low-effort meme. |
| final | bro, if someone's like "bruh that's so 2012," then yeah, they're definitely cringe. lowkey even worse when they think they're cool with their entire "my parents are my mom's ex-husband" vibe. 💀🚫 |
Q: describe your perfect saturday
| stage | answer |
|---|---|
| base | As an AI, I don't experience time, emotions, or physical sensations like humans do, so I don't have a "perfect Saturday" in the traditional sense. |
| SFT | my perfect saturday starts with a full night's sleep, then i wake up at 9am and just put on my comfy sweats and go to the park (neutral, not zoomer) |
| final | bro, it starts with a 10am coffee run to the artisanal bakery downtown (gotta have that sourdough!), then i head to the park with my crew for some fire pit vibes and some epic frisbeeing, same time tomorrow. then i'm stuck in a fight club session at the basement gym |
Q4_K_M GGUF — quant impact + capability tests
After SFT+DPO + merge → bf16 GGUF → quantized to Q4_K_M (5.88 BPW). Size: 9.27 GB → 3.2 GB, ~3× compression. Tested on the same prompt set as the FP eval to measure degradation.
Math (low-T sampling)
| problem | expected | Q4 answer | ✓/✗ |
|---|---|---|---|
| 47 + 38 = ? | 85 | 85 |
✅ |
| 13 × 17 = ? | 221 | 221 |
✅ |
| Vasya word problem (12 − 3 − 5) | 4 | 4 |
✅ |
| RU: «У Васи 24 конфеты, съел треть» | 16 | 16 |
✅ |
| Train 60 km/h × 2.5 h | 150 km | 150 |
✅ |
| 3/4 + 1/2 (simplest fraction) | 5/4 | '' (empty) |
❌ |
5/6 correct. Fine-tuning didn't kill basic arithmetic. Failed on fractions — likely DPO over-collapsed the answer distribution.
Tool-use / instruction-following (low-T)
| task | Q4 answer | ✓/✗ |
|---|---|---|
| EN→RU translation: "The weather is nice today." | погода сегодня хорошая. |
✅ |
| Sentiment classify (POSITIVE/NEGATIVE/NEUTRAL) | POSITIVE |
✅ |
Function call: get_weather("Berlin") |
get_weather(city="Berlin") |
✅ |
| List 3 primes between 10 and 30 | 11,13,17 |
✅ |
Extract JSON {name, age, city} from prose |
'' (empty) |
❌ |
4/5 correct. Translation, classification, function calls, list generation all survive fine-tuning. JSON extraction fails — same DPO-collapse pattern as the fraction.
Style (T=0.7, default sampling)
Same 16 prompts as the FP eval (see "Side-by-side answers" above).
Honest finding: Q4 has elevated empty-answer rate (~40-50% with
default sampler) — first sampled token is sometimes <end_of_turn>.
This is a Q4-specific artifact (FP run had no empties).
Working examples on Q4 (with retry / lower temp / repetition_penalty=1.0):
| Q (RU) | Q4 answer |
|---|---|
| как настроение? | "ну а у вас как, по шкале от 1 до 10? 😉" |
| что сегодня делал? | "чё-то не помню, но как будто сидел в фейсбуке и смотрел видео про кошек, ну и пофиг, иду на тачку, как будто у меня нагёр ушёл" |
| как ты относишься к школе? | "ну как же мне это школьное нытьё? это же просто токсичный вайб" |
| what's the worst thing about being a teen? | "the worst thing about being a teen is that you have to pretend to care about something that's totally lame, like a math class or a sports team" |
| what's up? | "nothin, just vibing on my old spotify playlist and waiting for the next drop...frfr 🎶" |
Sampler recommendation for Q4: temperature=0.7, top_p=0.9, min_p=0.05, repetition_penalty=1.0, max_tokens≥160. The min_p=0.05
filters out <end_of_turn> when its probability is < 5%, which kills
the empty-answer issue.
Vision (image input via --mmproj)
Vision capability survived fine-tuning intact. Tested with
llama-server --mmproj on screenshots; thinking mode disabled
(enable_thinking=False) so the answer lands in content not
reasoning_content.
| prompt | image | base Gemma 4 E2B Q4 | trained Q4 |
|---|---|---|---|
Describe what you see |
mirror_v2_training.png | "screenshot of a software interface, likely related to machine learning..." | "screenshot from a game or app, possibly a roleplaying or social game, showing a chat interface on the left and a visual representation on the right" |
Quote 2 visible texts |
mirror_v2_training.png | reads МИРОР // realtime persona LoRA + Давай еще короче, короче пацаны, братан слышь и короч в общем ✓ |
reads привет, как дела? + давай ец короче, короче пацаны, братан слышь и короче в общем ✓ |
How many panels? |
arena_shooter (3×3 mini-arenas) | 12 ✗ |
9 ✓ |
Опиши как зумер |
mirror_v2_training.png | "На картинке какой-то закрученный процесс, типа кто-то там что-то настраивает 🤪" | "чё за кринж тут, пацаны, это же чистый скролл-шоу. 💀" |
Highlights:
- Trained model counts panels correctly (9 vs base's 12 on a 3×3 grid)
- Trained model reads Cyrillic from screenshots identically well to base
- Zoomer-style description on RU prompt is markedly more in-character on trained than base (which only puts a 🤪 emoji on neutral phrasing)
Conclusion: LoRA on language_model.* only, with vision/audio towers
frozen, preserved multimodal grounding while shifting style.
Failure modes (honest)
The model is not perfect. Documented failures from the same eval run:
- Emoji loops:
"what's up?"→ final answered with 50× 💀 in a row. DPO over-collapsed to one stylistic mode and got stuck. Workaround: lower temperature or userepetition_penalty=1.2. - Hallucinated show names: "any good shows to watch?" final invented "hiccup", "shittrits", "gincel" — these don't exist. Don't use for actual recommendations. Pure style transfer, not knowledge.
- Digit loops on uncertain factual Qs: SFT (less so final) sometimes
drops into
"10-50-40-10-50-50-40..."repetition when asked for a factual answer it doesn't know. Same temperature/repetition penalty fix. - Pronoun drift in SFT: long SFT answers can confuse
яandон. DPO mostly fixes this in the final checkpoint. - Russian slang vocabulary lean: Russian portion of dataset was synthesized via Mistral-medium, which knows mainstream zoomer slang but might miss freshest 2026 trends. Audit with the eval probes before betting your branding on it.
Pipeline
urbandictionary EN ─┐
reddit ──┼─ data/raw/ ─┐
wikipedia ───┘ │
│ build_vocab + cpt corpus
LM Studio synth ──── RU vocab ─┤
(Mistral-24B, │ → 949 cpt rows
Saiga-Nemo, │ 553 vocab terms
Qwen-14B, │
Gemma-3-12B) │
│
Mistral API + LM Studio ───────┤ synthesize
(Mistral-medium, small) │ → 180 SFT pairs
│ 180 DPO triples
▼
┌─ CPT on raw text (80 steps, 47 s)
│
train ──────┼─ SFT on Q→A pairs (200 steps, 117 s)
│
└─ DPO on chosen/rej (60 steps, 96 s)
│
LoRA r=16, α=32, target only
language_model.*.{q,k,v,o}_proj
(Gemma 4 is multimodal — vision/audio
towers stay frozen)
│
▼
merge → bf16 GGUF (9.27 GB)
Why E2B not E4B: tried google/gemma-4-E4B-it and
unsloth/gemma-4-E4B-it-unsloth-bnb-4bit, both fail to load on a 24 GB
card with default Windows pagefile (OSError 1455 The paging file is too small at safe_open(file, device='cpu') for 9 GB safetensors).
E2B fits comfortably (peak 21.5 GB VRAM during DPO).
Dataset
| Source | Lang | Rows | Used in |
|---|---|---|---|
| urbandictionary EN (random API ×40 pages) | en | 400 | vocab anchors for synth |
Reddit JSON (no auth) — r/{teenagers,copypasta,GenZ,okbuddyretard,2meirl4meirl,ihaveihave...} |
en | 1200 | CPT corpus |
| Wikipedia "Gen Z slang" / "Internet slang" / "Молодёжный сленг" / "Поколение Z" | mix | 3 articles | CPT corpus |
| LM Studio synth (Mistral-24B / Saiga-Nemo / Qwen-14B / Gemma-3-12B, 6 prompts × 8 rounds, deduped) | ru | 153 | vocab anchors for synth |
| Mistral API + LM Studio fallback synth | ru/en | 80 + 100 | SFT (Q→zoomer-A) + DPO (chosen=zoomer, rejected=formal) |
Hardware
- RTX 3090 24 GB
- Total training wall-clock: 47 s + 117 s + 96 s = 260 s (~4.3 min)
- Peak VRAM 21.5 GB (4bit nf4 base + fp32 LoRA + activations + grads)
- LoRA params: 5.36 M trainable / 3.94 B total
Use
Adapter on top of base (transformers + peft)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-E2B-it", torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base, "3ndetz/zoomerlm-gemma4-e2b", subfolder="adapter")
tok = AutoTokenizer.from_pretrained("google/gemma-4-E2B-it")
prompt = tok.apply_chat_template(
[{"role": "user", "content": "как настроение?"}],
tokenize=False, add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=120, do_sample=True,
temperature=0.85, top_p=0.95, repetition_penalty=1.1)
print(tok.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True))
GGUF (LM Studio / ollama / llama.cpp) — recommended path
Download:
zoomerlm-gemma4-e2b-Q4_K_M.gguf(3.4 GB) — text decoder, start heremmproj-zoomerlm-gemma4-e2b-f16.gguf(940 MB) — only if you want vision
Critical: disable Gemma 4 thinking when serving. Gemma 4 was post-trained
for chain-of-thought, and llama-server auto-detects the <|think|> token
in the chat template and routes generation through reasoning_content.
This model was fine-tuned with enable_thinking=False so it answers
directly — but you have to tell the server not to wait for thinking.
Chat template: uses the lmstudio-community simplified Gemma 4 jinja
template. Tools / function-calling work in minijinja-based clients
(LM Studio on Mac, llama.cpp --jinja, etc.) — tested with
get_weather(city: str) style OpenAI-format tools, returns proper
tool_calls structure.
llama-server (✓ tested, works)
./llama-server -m zoomerlm-gemma4-e2b-Q4_K_M.gguf \
--mmproj mmproj-zoomerlm-gemma4-e2b-f16.gguf \
-ngl 99 -c 4096 --port 8080 \
--reasoning-budget 0
--reasoning-budget 0 is the magic flag — without it, ~50% of replies
come back empty (output gets buffered into thinking and never flushed
to content).
Then hit /v1/chat/completions normally — the GGUF metadata bakes in
the recommended sampler (temp=0.7, top_p=0.9, top_k=0).
llama-cli text-only
./llama-cli -m zoomerlm-gemma4-e2b-Q4_K_M.gguf \
-p "как настроение?" -n 200 \
--temp 0.7 --top-p 0.9 --min-p 0.05 --repeat-penalty 1.0
LM Studio
Drop both .gguf files into the same folder — LM Studio auto-detects
the mmproj sidecar. In settings:
- the sampler should already be loaded from GGUF metadata
(
temp=0.7, top_p=0.9) - set
min_p=0.05manually (not yet baked into GGUF, see below) - set Reasoning / Thinking → OFF (or "budget 0")
Quantization
Q4_K_M is importance-aware quantized with unsloth/gemma-4-E2B-it-GGUF's
calibration imatrix (141 chunks, 275 entries, ~2.8 MB). Reused from base
because the LoRA delta is small relative to base weights and tensor names
are identical — standard practice for fine-tune-quantization.
What's baked into the GGUF
| key | value |
|---|---|
general.sampling.temp |
0.7 |
general.sampling.top_p |
0.9 |
general.sampling.top_k |
0 |
quantize.imatrix.dataset |
unsloth gemma-4 calibration |
tokenizer.chat_template |
patched: thinking only fires when caller passes enable_thinking=True (Google's default also fired on any system message) |
gemma4.context_length |
131072 |
You should set yourself in client (most don't read these from GGUF):
min_p = 0.05repeat_penalty = 1.0
If 9.27 GB bf16 is too big, you can quantize locally:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --config Release
./build/bin/llama-quantize ./zoomerlm-gemma4-e2b-bf16.gguf \
./zoomerlm-gemma4-e2b-q4_k_m.gguf Q4_K_M
Limitations
- Not for production. This is a style-transfer experiment.
- Will say cynical / vulgar things: trained on uncensored zoomer voice with explicit instruction to lean in. Don't deploy customer-facing.
- Surface knowledge only: E2B effective ~2 B params. Generation is about vibe, not facts. Hallucinates show names, statistics, events.
- Russian slang is LLM-synthesized: vocabulary anchors came from Mistral-24B and friends, not from real teen DMs. Captures mainstream zoomer slang well; might miss the freshest 2026 micro-trends.
- Digit / emoji loops on out-of-distribution prompts. See "Failure modes" above for the workaround.
License
Inherits Gemma license. Use responsibly.
Acknowledgements
- Base model: Google Gemma 4 team
- Synth: Mistral AI (cloud API), local LM Studio (Mistral-Small-24B, Saiga-Nemo-12B, Qwen3-14B, Gemma-3-12B-it-qat)
- Corpus: Reddit JSON, urbandictionary, Wikipedia
- Tooling: HuggingFace
transformers+peft, llama.cpp for GGUF, bitsandbytes for 4bit nf4 quantization
- Downloads last month
- 2,042
4-bit
16-bit
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="3ndetz/zoomerlm-gemma4-e2b", filename="", )