Gemma-3-12b-it β Heretic Uncensored Text Encoder for LTX-2.3
β οΈ Important: This encoder is for text embedding only β not prompt enhancement.
The
TextGenerateLTX2Promptnode uses Gemma as a text generator (token-by-token sampling). Abliteration breaks instruction following in this mode β instead of outputting a clean structured prompt, the model will ramble, reason through the task, and wrap the actual prompt in markdown/thinking noise. The resulting garbage gets passed downstream as your conditioning text and ruins your output.Use this encoder only in
CLIP Text Encode/DualCLIPLoaderfor embedding. Write your prompts manually using the[VISUAL]/[AUDIO]structured format (see below), or use a stock Gemma encoder for the prompt enhancement node and this encoder for everything else (but that will "soften" explicit results as gemma models are intended to do).
An uncensored version of Google's Gemma-3-12b-it, processed with Heretic to remove safety-aligned prompt filtering, then packaged for direct use as a text encoder in LTX-2.3 video generation via ComfyUI.
Make sure to use Kijai/LTX2.3_comfy for the actual files. The checkpoint (rather than split files) includes the stock gemma, which will filter your heretic generated prompt silently, and basically make this useless.
Why this exists
LTX-2.3 uses Gemma-3-12b-it as its text encoder. The stock Gemma model includes safety alignment that silently weakens or sanitizes certain prompt embeddings β even when the model doesn't overtly refuse, the internal representations of creative or mature concepts are diluted before they reach the video transformer. This results in reduced prompt adherence, visual softening, and concept dilution for a range of legitimate creative use cases.
This encoder removes those restrictions at the weight level while preserving the model's language understanding capabilities, giving LTX-2.3 the most faithful possible interpretation of your prompts.
Writing prompts manually
Since the prompt enhancement node can't be used with this encoder, write your prompts in LTX-2.3's structured format:
[VISUAL]: Medium shot of a woman in a black dress walking through a neon-lit street at night.
She turns toward the camera and smiles. Slow dolly forward, shallow depth of field, warm
highlights from neon signs reflecting on wet pavement. Her hair moves naturally in a light breeze.
[AUDIO]: City ambience, distant traffic, soft footsteps on wet concrete, faint electronic music
from a nearby bar, a car horn in the distance.
Tips:
- Describe action across the full clip duration (3-4 beats across 10 seconds)
- Include camera direction (dolly, pan, tracking, static)
- Describe lighting and mood
- Don't describe character appearance if you're using I2V with a reference image β the image handles identity
- Audio descriptions directly influence the synchronized audio generation
Available variants
| File | Size | Format | Notes |
|---|---|---|---|
gemma-3-12b-it-heretic-bf16.safetensors |
~24 GB | BF16 | Full precision. Maximum quality, no quantization loss. |
gemma-3-12b-it-heretic-fp8-comfy.safetensors |
~14.5 GB | FP8 E4M3FN | ComfyUI-native block-scaled quantization. Recommended for most users. |
gemma-3-12b-it-heretic-fp4-comfy.safetensors |
~9.5 GB | NVFP4 | ComfyUI-native double-quantized FP4. Same format as Kijai's official packages. Best for β€24 GB VRAM. |
All variants include the SentencePiece tokenizer embedded in the safetensors file. Embeddings and layer norms are kept in original precision across all quantizations.
How to use
- Download whichever variant fits your VRAM budget to
ComfyUI/models/text_encoders/ - In your LTX-2.3 workflow, select this file as the Gemma text encoder (first slot of the DualCLIPLoader or LTXAVTextEncoderLoader node)
- Select the standard
ltx-2.3_text_projection_bf16.safetensorsas the second encoder (unchanged β the projection layer has no safety alignment)
Works with all LTX-2.3 ComfyUI workflows including text-to-video, image-to-video, and IC-LoRA control pipelines.
Dual-encoder setup (recommended for prompt enhancement)
If you want to use the TextGenerateLTX2Prompt node for automatic prompt enhancement, use two DualCLIPLoaders:
- DualCLIPLoader #1 (stock Gemma) β wired to
TextGenerateLTX2Promptonly - DualCLIPLoader #2 (heretic Gemma) β wired to
CLIP Text Encode(positive/negative)
The stock encoder generates a clean structured prompt. The heretic encoder embeds it without safety filtering. No content filtering where it counts.
Processing details
- Base model: google/gemma-3-12b-it (full BF16 weights, not QAT)
- Abliteration tool: Heretic v1.2.0
- Method: Directional orthogonalization of attention out-projection and MLP down-projection matrices against computed refusal directions, with per-layer optimized ablation weights minimizing KL divergence from the original model
- Key prefix: Weights are stored with
model.*prefix (matching ComfyUI's expected format), not the HuggingFacelanguage_model.model.*prefix - Tokenizer: SentencePiece model embedded as
spiece_modeltensor key (ComfyUI's expected format) - Quantization: FP8 and FP4 variants produced with per-block scaling matching ComfyUI's
MixedPrecisionOpsformat (block size 16, FP8 E4M3FN scales, scalar float32 global scale)
Compatibility
- ComfyUI β₯ 0.16.1 (LTX-2.3 native support)
- LTX-2.3 (22B dev, distilled, or any variant using Gemma text encoder)
- LTX-2 (19B, should also work β same text encoder architecture)
- Works on both NVIDIA (CUDA) and AMD (ROCm) GPUs
Limitations
- This is a text encoder only β it does not add NSFW visual generation capability to the LTX-2.3 video transformer (it was intentionally trained on ONLY SFW data). The video model must still be fine-tuned separately to generate content outside its training distribution (see below for easy solution).
- Cannot be used with
TextGenerateLTX2Promptβ abliteration breaks instruction following for text generation. See note at top. - The abliteration removes refusal-direction alignment from the encoder weights. The resulting embeddings are more faithful to input prompts but the model may produce slightly unexpected outputs for prompts that the original model would have redirected.
- Quality of quantized variants depends on the inference backend. FP4 on hardware without native FP4 support will use emulated dequantization.
Fix for Limitations
(I recommend using a LoRA rather than checkpoint, but either way works)
- NSFW LoRA Option: Here's a decent option if NSFW is what you're going for without an I2V including NSFW details already (weird LoRA name, but it's for general NSFW. I take no responsibility for anything generated with any of these resources): CivitAI General NSFW LoRA
- Finding Other LoRAs or Checkpoints: There are MANY options, courtesy of the generous/"interesting" CivitAI community. To actually see the options, register for CivitAI, login, and adjust maturity settings. Then filter by LTX2.3 in the models tab. Worth noting that much of the offerings are disturbing and you'll wish you could unsee them.
Credits
- Lightricks for LTX-2.3
- Google for Gemma-3-12b-it
- Philipp Emanuel Weidmann for Heretic
- Kijai for the ComfyUI-compatible packaging format this release follows
License
The model weights inherit the Gemma license. Please review Google's terms for usage restrictions.