Gemma 4 E2B Uncensored-MAX — LiteRT-LM

On-device .litertlm conversion of prithivMLmods/gemma-4-E2B-it-Uncensored-MAX for the LiteRT-LM runtime. Runs completely offline on Android, Pixel, and Google Tensor / Mali-based devices.

Spec	Value
Base	`prithivMLmods/gemma-4-E2B-it-Uncensored-MAX`
Format	`.litertlm` (LiteRT-LM bundle)
Quantization	INT4 (`dynamic_wi4_afp32`)
Context	32,768 tokens
Size	2.37 GB
Effective params	2.3B (5.1B total with PLE)
Architecture	Dense, 35 layers, Per-Layer Embeddings (PLE)
Modalities	Text only (vision/audio weights stripped)
License	Apache 2.0

What This Model Is

This is a community fine-tuned variant of Google's Gemma 4 E2B (2.3B effective parameter) instruction-tuned model, packaged for on-device inference via the LiteRT-LM engine. The base model was further tuned on additional datasets and then converted to Google's .litertlm format for native mobile runtime execution.

Key characteristics:

Text-only — image and audio understanding weights were removed to keep the bundle small (2.37 GB vs ~5 GB multimodal)
32K context window — suitable for long documents, multi-turn chat, and agentic workflows
INT4 quantized — compression for mobile storage with minimal quality impact
On-device — runs completely offline on Android / Pixel / Tensor / Mali devices

⚠️ Standard disclaimer: Like all open-weight language models, outputs may contain inaccuracies, biases, or content you find objectionable. This is an unofficial community build; outputs may differ from Google's official checkpoint due to fine-tuning and INT4 re-quantization. Evaluate outputs critically and do not rely on this model for high-stakes decisions without human review.

How to Use

Google AI Edge Gallery (Pixel / Android)

Install Google AI Edge Gallery or a compatible fork
Add model via HuggingFace URL: PeppX/gemma-4-e2b-uncensored-litertlm
Download the 2.37 GB bundle over WiFi
Chat offline — no data leaves your device

LiteRT-LM CLI (macOS / Linux / Desktop)

pip install litert-lm
litert-lm run --from-huggingface-repo PeppX/gemma-4-e2b-uncensored-litertlm \
              gemma-4-E2B-it-Uncensored-MAX.litertlm \
              --prompt "Your prompt here" \
              --backend gpu

Direct Android Integration

If you're building an app with model_allowlist.json:

{
  "name": "Gemma-4-E2B-it",
  "modelId": "PeppX/gemma-4-e2b-uncensored-litertlm",
  "modelFile": "gemma-4-E2B-it-Uncensored-MAX.litertlm",
  "sizeInBytes": 2550041824,
  "commitHash": "e7e8f3d72ad3f667f313e4c7b118491d696d339f",
  "llmSupportImage": false,
  "llmSupportAudio": false,
  "llmSupportThinking": true,
  "defaultConfig": {
    "topK": 64, "topP": 0.95, "temperature": 1.0,
    "maxContextLength": 32000, "maxTokens": 4000,
    "accelerators": "gpu,npu,cpu"
  }
}

Technical Details for Developers & Agents

Export Pipeline

from litert_torch.generative.export_hf import export as export_lib

export_lib.export(
    model=MODEL_DIR,                    # Text-decoder weights only
    output_dir=OUTPUT_DIR,
    task="text_generation",
    bundle_litert_lm=True,
    quantization_recipe="dynamic_wi4_afp32",
    cache_length=32000,
    prefill_lengths=[256],
    use_jinja_template=True,
    externalize_embedder=True,
    single_token_embedder=True,
)

Weight Extraction

Source was multimodal (text + image + audio). Only model.language_model.* safetensors were retained. Vision/audio config keys stripped:

vision_config, audio_config
image_token_id, audio_token_id, video_token_id, boi_token_id, eoi_token_id

Component Breakdown

Component	Original	Quantized
`model.tflite` (decoder)	8.50 GB	1.08 GB (INT4)
`embedder.tflite`	1.50 GB	195 MB (INT4)
`per_layer_embedder.tflite`	8.75 GB	1.10 GB (INT4)
Bundle total	—	2.37 GB

Environment

Host: Google Colab A100 80GB
torch: 2.12.0+cu130 (CUDA 13.0)
litert-torch: 0.9.1
Export time: ~12 minutes
Public litert-torch path (not Google's internal tools)

Known Limitations

PLE bloat: Google's stock E2B is 2.41 GB with proprietary PLE compression. Community exports via public litert-torch produce larger PLE weights (~1.1 GB vs ~0.6 GB). This is expected and the model runs correctly.
No vision/audio: Stripped for size. If you need multimodal, use the original google/gemma-4-E2B-it or a full-weight community export.
INT4 quality: Slightly lower precision than INT8. Most users won't notice in conversational use, but long-reasoning chains may show minor degradation.
Chat template: Uses the source model's Jinja template. If the Conversation API fails, fall back to raw Session API inference.

Credits

Packaged by lethflow.com — an Ottawa-based AI solutions team that makes operational complexity disappear.

Named for the river Lethe in Greek mythology — the waters that erased what no longer served you. Lethflow does the same for manual workflows, replacing them with seamless AI-powered integrations, real-time analytics, automated compliance, smart scheduling, and intelligent customer communication.

We believe AI should work with your existing tools, not replace them. Every solution connects to the software your team already trusts — making it smarter, faster, and more efficient.

This model was exported, optimized, and packaged by the Lethflow team as part of our on-device AI R&D. If you're building agentic systems, custom Android AI apps, or need help bridging HuggingFace models to mobile runtimes — reach out.

Base Model & Attribution

Original checkpoint: google/gemma-4-E2B-it (Google DeepMind)
Community fine-tune: prithivMLmods/gemma-4-E2B-it-Uncensored-MAX
LiteRT-LM packaging: PeppX / lethflow.com
License: Apache 2.0 (inherited from Gemma)

This is an unofficial community build. It is not published by Google or the LiteRT team. Outputs may differ from the source checkpoint due to INT4 re-quantization during packaging.

Tensor G4 / Mali GPU Optimization (2025-05-13)

A metadata patch was applied to ensure the model is recognized as gemma4 by the LiteRT-LM runtime, enabling Gemma4DataProcessor and Multi-token Prediction (MTP) on Google Tensor G4 and compatible Mali GPUs.

Technical Details

Attribute	Value
Patch offset	16420 (within `LlmMetadataProto` section)
Original bytes	`0a 00` (`generic_model`, field 1)
Patched bytes	`42 00` (`gemma4`, field 8)
Runtime effect	Activates Gemma-4-specific data processor and MTP decode path
Expected speedup	~2x faster token generation on mobile GPU/NPU vs generic fallback

This is a runtime compatibility correction. Model weights, architecture, and inference behavior are unchanged. The patch only updates the llm_model_type enum in the .litertlm bundle header so that LiteRT-LM v0.11.0+ selects the optimized decode pipeline instead of the generic fallback.

Commit Hash Update

The patched bundle corresponds to commit e7e8f3d72ad3f667f313e4c7b118491d696d339f. If you are referencing this model in a model_allowlist.json or similar registry, update the commitHash field to this value for integrity verification.

Downloads last month: 5,379

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for PeppX/gemma-4-e2b-uncensored-litertlm

Base model

google/gemma-4-E2B

Finetuned

google/gemma-4-E2B-it

Finetuned

prithivMLmods/gemma-4-E2B-it-Uncensored-MAX

Finetuned

(4)

this model