Gemma 4 E2B Uncensored-MAX β€” LiteRT-LM

On-device .litertlm conversion of prithivMLmods/gemma-4-E2B-it-Uncensored-MAX for the LiteRT-LM runtime. Runs completely offline on Android, Pixel, and Google Tensor / Mali-based devices.

Spec Value
Base prithivMLmods/gemma-4-E2B-it-Uncensored-MAX
Format .litertlm (LiteRT-LM bundle)
Quantization INT4 (dynamic_wi4_afp32)
Context 32,768 tokens
Size 2.37 GB
Effective params 2.3B (5.1B total with PLE)
Architecture Dense, 35 layers, Per-Layer Embeddings (PLE)
Modalities Text only (vision/audio weights stripped)
License Apache 2.0

What This Model Is

This is a community fine-tuned variant of Google's Gemma 4 E2B (2.3B effective parameter) instruction-tuned model, packaged for on-device inference via the LiteRT-LM engine. The base model was further tuned on additional datasets and then converted to Google's .litertlm format for native mobile runtime execution.

Key characteristics:

  • Text-only β€” image and audio understanding weights were removed to keep the bundle small (2.37 GB vs ~5 GB multimodal)
  • 32K context window β€” suitable for long documents, multi-turn chat, and agentic workflows
  • INT4 quantized β€” compression for mobile storage with minimal quality impact
  • On-device β€” runs completely offline on Android / Pixel / Tensor / Mali devices

⚠️ Standard disclaimer: Like all open-weight language models, outputs may contain inaccuracies, biases, or content you find objectionable. This is an unofficial community build; outputs may differ from Google's official checkpoint due to fine-tuning and INT4 re-quantization. Evaluate outputs critically and do not rely on this model for high-stakes decisions without human review.

How to Use

Google AI Edge Gallery (Pixel / Android)

  1. Install Google AI Edge Gallery or a compatible fork
  2. Add model via HuggingFace URL: PeppX/gemma-4-e2b-uncensored-litertlm
  3. Download the 2.37 GB bundle over WiFi
  4. Chat offline β€” no data leaves your device

LiteRT-LM CLI (macOS / Linux / Desktop)

pip install litert-lm
litert-lm run --from-huggingface-repo PeppX/gemma-4-e2b-uncensored-litertlm \
              gemma-4-E2B-it-Uncensored-MAX.litertlm \
              --prompt "Your prompt here" \
              --backend gpu

Direct Android Integration

If you're building an app with model_allowlist.json:

{
  "name": "Gemma-4-E2B-it",
  "modelId": "PeppX/gemma-4-e2b-uncensored-litertlm",
  "modelFile": "gemma-4-E2B-it-Uncensored-MAX.litertlm",
  "sizeInBytes": 2550041824,
  "commitHash": "e7e8f3d72ad3f667f313e4c7b118491d696d339f",
  "llmSupportImage": false,
  "llmSupportAudio": false,
  "llmSupportThinking": true,
  "defaultConfig": {
    "topK": 64, "topP": 0.95, "temperature": 1.0,
    "maxContextLength": 32000, "maxTokens": 4000,
    "accelerators": "gpu,npu,cpu"
  }
}

Technical Details for Developers & Agents

Export Pipeline

from litert_torch.generative.export_hf import export as export_lib

export_lib.export(
    model=MODEL_DIR,                    # Text-decoder weights only
    output_dir=OUTPUT_DIR,
    task="text_generation",
    bundle_litert_lm=True,
    quantization_recipe="dynamic_wi4_afp32",
    cache_length=32000,
    prefill_lengths=[256],
    use_jinja_template=True,
    externalize_embedder=True,
    single_token_embedder=True,
)

Weight Extraction

Source was multimodal (text + image + audio). Only model.language_model.* safetensors were retained. Vision/audio config keys stripped:

  • vision_config, audio_config
  • image_token_id, audio_token_id, video_token_id, boi_token_id, eoi_token_id

Component Breakdown

Component Original Quantized
model.tflite (decoder) 8.50 GB 1.08 GB (INT4)
embedder.tflite 1.50 GB 195 MB (INT4)
per_layer_embedder.tflite 8.75 GB 1.10 GB (INT4)
Bundle total β€” 2.37 GB

Environment

  • Host: Google Colab A100 80GB
  • torch: 2.12.0+cu130 (CUDA 13.0)
  • litert-torch: 0.9.1
  • Export time: ~12 minutes
  • Public litert-torch path (not Google's internal tools)

Known Limitations

  1. PLE bloat: Google's stock E2B is 2.41 GB with proprietary PLE compression. Community exports via public litert-torch produce larger PLE weights (~1.1 GB vs ~0.6 GB). This is expected and the model runs correctly.
  2. No vision/audio: Stripped for size. If you need multimodal, use the original google/gemma-4-E2B-it or a full-weight community export.
  3. INT4 quality: Slightly lower precision than INT8. Most users won't notice in conversational use, but long-reasoning chains may show minor degradation.
  4. Chat template: Uses the source model's Jinja template. If the Conversation API fails, fall back to raw Session API inference.

Credits

Packaged by lethflow.com β€” an Ottawa-based AI solutions team that makes operational complexity disappear.

Named for the river Lethe in Greek mythology β€” the waters that erased what no longer served you. Lethflow does the same for manual workflows, replacing them with seamless AI-powered integrations, real-time analytics, automated compliance, smart scheduling, and intelligent customer communication.

We believe AI should work with your existing tools, not replace them. Every solution connects to the software your team already trusts β€” making it smarter, faster, and more efficient.

This model was exported, optimized, and packaged by the Lethflow team as part of our on-device AI R&D. If you're building agentic systems, custom Android AI apps, or need help bridging HuggingFace models to mobile runtimes β€” reach out.


Base Model & Attribution

  • Original checkpoint: google/gemma-4-E2B-it (Google DeepMind)
  • Community fine-tune: prithivMLmods/gemma-4-E2B-it-Uncensored-MAX
  • LiteRT-LM packaging: PeppX / lethflow.com
  • License: Apache 2.0 (inherited from Gemma)

This is an unofficial community build. It is not published by Google or the LiteRT team. Outputs may differ from the source checkpoint due to INT4 re-quantization during packaging.


Tensor G4 / Mali GPU Optimization (2025-05-13)

A metadata patch was applied to ensure the model is recognized as gemma4 by the LiteRT-LM runtime, enabling Gemma4DataProcessor and Multi-token Prediction (MTP) on Google Tensor G4 and compatible Mali GPUs.

Technical Details

Attribute Value
Patch offset 16420 (within LlmMetadataProto section)
Original bytes 0a 00 (generic_model, field 1)
Patched bytes 42 00 (gemma4, field 8)
Runtime effect Activates Gemma-4-specific data processor and MTP decode path
Expected speedup ~2x faster token generation on mobile GPU/NPU vs generic fallback

This is a runtime compatibility correction. Model weights, architecture, and inference behavior are unchanged. The patch only updates the llm_model_type enum in the .litertlm bundle header so that LiteRT-LM v0.11.0+ selects the optimized decode pipeline instead of the generic fallback.

Commit Hash Update

The patched bundle corresponds to commit e7e8f3d72ad3f667f313e4c7b118491d696d339f. If you are referencing this model in a model_allowlist.json or similar registry, update the commitHash field to this value for integrity verification.

Downloads last month
803
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for PeppX/gemma-4-e2b-uncensored-litertlm

Finetuned
(4)
this model