Clariso · Gemma 4 E4B · v9 mixed LoRA adapter

Rank-8 LoRA adapter that re-styles google/gemma-4-E4B-it answers into the plain-language convention. Intended audience: adults with cognitive impairments, second- language readers, anyone who benefits from short sentences, common words, and a lede-first structure.

The adapter is designed to be loaded with a gated generation strategy (base_thought_lora_answer): keep the LoRA off during the base model's <|channel>thought ... <channel|> reasoning span, then flip it on for the answer. This preserves Gemma 4's full reasoning capability and applies the plain-language compression only to the final output.

Demo

Try it in a browser: Clariso Space

Quick usage

import torch
from transformers import AutoModelForImageTextToText, AutoTokenizer
from peft import PeftModel

base_id = "google/gemma-4-E4B-it"
adapter_id = "kameronk/clariso-gemma4-e4b-v9-peft"

tokenizer = AutoTokenizer.from_pretrained(base_id)
base = AutoModelForImageTextToText.from_pretrained(
    base_id, torch_dtype=torch.bfloat16, device_map="cuda"
)
model = PeftModel.from_pretrained(base, adapter_id)
model.train(False)

# The trained system prompt (keep verbatim — the LoRA was conditioned on this).
system = (
    "You write accessible answers for adults with cognitive impairments. Rules:\n"
    "- One idea per sentence.\n"
    "- Short sentences.\n"
    "- Simple, common words.\n"
    "- Put the main answer first.\n"
    "- Use bullet points for lists.\n"
    "- Keep the answer brief.\n"
    "- Be concrete.\n"
    "- Be reassuring without talking down to the reader."
)
user = "What should I do if my child has a fever?"

prompt = tokenizer.apply_chat_template(
    [{"role": "system", "content": system}, {"role": "user", "content": user}],
    tokenize=False, add_generation_prompt=True, enable_thinking=True,
)

# Gated generation: adapter OFF during thinking, ON for the answer.
# `<channel|>` is the marker that ends the reasoning span and begins the answer.
channel_id = tokenizer(["<channel|>"], add_special_tokens=False).input_ids[0][0]
end_id = tokenizer(["<turn|>"], add_special_tokens=False).input_ids[0][0]

model.disable_adapter_layers()
ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
past, cur, out_ids, flipped = None, ids, [], False
with torch.no_grad():
    for _ in range(400):
        out = model(input_ids=cur, past_key_values=past, use_cache=True)
        past = out.past_key_values
        logits = out.logits[:, -1, :] / 1.0  # temp = 1.0 (Gemma 4 default)
        nxt = torch.multinomial(torch.softmax(logits, dim=-1), 1)
        nid = int(nxt.item())
        out_ids.append(nid)
        if nid == channel_id and not flipped:
            model.enable_adapter_layers()
            flipped = True
        if nid == end_id:
            break
        cur = nxt

print(tokenizer.decode(out_ids, skip_special_tokens=False))

Recommended sampling

The base Gemma 4 sampling spec from Google: temperature=1.0, top_p=0.95, top_k=64. Running this adapter at lower temperatures degrades reasoning quality (see scripts/run_easyread_v10_battery.py in the source repo for the calibration study).

Architecture

  • Type: LoRA (PEFT), rank 8, lora_alpha 160, dropout 0.0
  • Target modules: q_proj, o_proj, gate_proj, up_proj, down_proj, per_layer_input_gate, per_layer_projection
  • Layers targeted: top 16 (layers_to_transform: [26..41])
  • Base: google/gemma-4-E4B-it
  • Storage: 26 MB (adapter_model.safetensors)

Training

  • Trainer: mlx-lm (Apple Silicon LoRA fine-tune); converted to PEFT format for cross-platform inference.
  • Hardware: M-series Mac.
  • Data: 622 mixed rows from easy_read_v6 — answer-only + channel-bound thinking variants. Self-bootstrapped from Gemma 4 31B as writer/critic/judge (no frontier-model dependence).
  • Recipe: lr 1e-5, scale 20, stop-weight 8, thought-loss 0.05.

Evaluation

Validated against a 20-question leakage-clean battery in runs/2026-05-02_v10_bf16-mixed-report/REPORT.md of the source repo. Under the recommended base_thought_lora_answer gate:

Metric vs. base Direction
Length (chars) −1,214 ⬇ briefer
Length (tokens) −265 ⬇ briefer
Flesch-Kincaid grade −3.56 ⬇ easier
Dale-Chall difficulty −2.26 ⬇ easier
Strict reasoning correctness 30/30 ✓ preserved
Empty-answer rate 0% ✓ no regressions
ARC capability screen no large delta ✓ general capability intact

Limitations

  • English only. v9 mixed corpus is 100% English; multilingual coverage is reduced relative to the base model. A v9-multilingual corpus is queued.
  • The trained system prompt matters. The LoRA was conditioned on the P_FULL system prompt verbatim. Using a different system prompt at inference time will weaken activation. Either keep P_FULL or use the neutral "You are a helpful assistant." (also seen during training).
  • Calibrated for temp=1.0. Low temperatures (≤0.3) collapse reasoning onto a "plan-only" trajectory and produce wrong answers on arithmetic questions. Stay at the Gemma 4 official spec.
  • Single-flight per process when using PEFT's adapter toggle. The enable/disable API mutates the model object globally — concurrent generation across threads requires a lock.

Intended use

  • Plain-language rewrites of factual or instructional content for cognitively diverse audiences.
  • Companion-style explanation of medical, legal, or technical material (paired with the gated-thinking strategy so the reasoning remains intact).

Out-of-scope use

  • Safety-critical clinical decisions without human review.
  • High-stakes legal or financial advice.
  • Domains requiring formal register or technical precision in the output — the adapter compresses by design.

Related

Downloads last month
103
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kameronk/clariso-gemma4-e4b-v9-peft

Adapter
(107)
this model