Gemma-3-12b-it — Heretic Uncensored Text Encoder for LTX-2.3

⚠️ Important: This encoder is for text embedding only — not prompt enhancement.

The TextGenerateLTX2Prompt node uses Gemma as a text generator (token-by-token sampling). Abliteration breaks instruction following in this mode — instead of outputting a clean structured prompt, the model will ramble, reason through the task, and wrap the actual prompt in markdown/thinking noise. The resulting garbage gets passed downstream as your conditioning text and ruins your output.

Use this encoder only in CLIP Text Encode / DualCLIPLoader for embedding. Write your prompts manually using the [VISUAL] / [AUDIO] structured format (see below), or use a stock Gemma encoder for the prompt enhancement node and this encoder for everything else (but that will "soften" explicit results as gemma models are intended to do).

An uncensored version of Google's Gemma-3-12b-it, processed with Heretic to remove safety-aligned prompt filtering, then packaged for direct use as a text encoder in LTX-2.3 video generation via ComfyUI.

Make sure to use Kijai/LTX2.3_comfy for the actual files. The checkpoint (rather than split files) includes the stock gemma, which will filter your heretic generated prompt silently, and basically make this useless.

Why this exists

LTX-2.3 uses Gemma-3-12b-it as its text encoder. The stock Gemma model includes safety alignment that silently weakens or sanitizes certain prompt embeddings — even when the model doesn't overtly refuse, the internal representations of creative or mature concepts are diluted before they reach the video transformer. This results in reduced prompt adherence, visual softening, and concept dilution for a range of legitimate creative use cases.

This encoder removes those restrictions at the weight level while preserving the model's language understanding capabilities, giving LTX-2.3 the most faithful possible interpretation of your prompts.

Writing prompts manually

Since the prompt enhancement node can't be used with this encoder, write your prompts in LTX-2.3's structured format:

[VISUAL]: Medium shot of a woman in a black dress walking through a neon-lit street at night.
She turns toward the camera and smiles. Slow dolly forward, shallow depth of field, warm
highlights from neon signs reflecting on wet pavement. Her hair moves naturally in a light breeze.

[AUDIO]: City ambience, distant traffic, soft footsteps on wet concrete, faint electronic music
from a nearby bar, a car horn in the distance.

Tips:

Describe action across the full clip duration (3-4 beats across 10 seconds)
Include camera direction (dolly, pan, tracking, static)
Describe lighting and mood
Don't describe character appearance if you're using I2V with a reference image — the image handles identity
Audio descriptions directly influence the synchronized audio generation

Available variants

File	Size	Format	Notes
`gemma-3-12b-it-heretic-bf16.safetensors`	~24 GB	BF16	Full precision. Maximum quality, no quantization loss.
`gemma-3-12b-it-heretic-fp8-comfy.safetensors`	~14.5 GB	FP8 E4M3FN	ComfyUI-native block-scaled quantization. Recommended for most users.
`gemma-3-12b-it-heretic-fp4-comfy.safetensors`	~9.5 GB	NVFP4	ComfyUI-native double-quantized FP4. Same format as Kijai's official packages. Best for ≤24 GB VRAM.

All variants include the SentencePiece tokenizer embedded in the safetensors file. Embeddings and layer norms are kept in original precision across all quantizations.

How to use

Download whichever variant fits your VRAM budget to ComfyUI/models/text_encoders/
In your LTX-2.3 workflow, select this file as the Gemma text encoder (first slot of the DualCLIPLoader or LTXAVTextEncoderLoader node)
Select the standard ltx-2.3_text_projection_bf16.safetensors as the second encoder (unchanged — the projection layer has no safety alignment)

Works with all LTX-2.3 ComfyUI workflows including text-to-video, image-to-video, and IC-LoRA control pipelines.

Dual-encoder setup (recommended for prompt enhancement)

If you want to use the TextGenerateLTX2Prompt node for automatic prompt enhancement, use two DualCLIPLoaders:

DualCLIPLoader #1 (stock Gemma) → wired to TextGenerateLTX2Prompt only
DualCLIPLoader #2 (heretic Gemma) → wired to CLIP Text Encode (positive/negative)

The stock encoder generates a clean structured prompt. The heretic encoder embeds it without safety filtering. No content filtering where it counts.

Processing details

Base model: google/gemma-3-12b-it (full BF16 weights, not QAT)
Abliteration tool: Heretic v1.2.0
Method: Directional orthogonalization of attention out-projection and MLP down-projection matrices against computed refusal directions, with per-layer optimized ablation weights minimizing KL divergence from the original model
Key prefix: Weights are stored with model.* prefix (matching ComfyUI's expected format), not the HuggingFace language_model.model.* prefix
Tokenizer: SentencePiece model embedded as spiece_model tensor key (ComfyUI's expected format)
Quantization: FP8 and FP4 variants produced with per-block scaling matching ComfyUI's MixedPrecisionOps format (block size 16, FP8 E4M3FN scales, scalar float32 global scale)

Compatibility

ComfyUI ≥ 0.16.1 (LTX-2.3 native support)
LTX-2.3 (22B dev, distilled, or any variant using Gemma text encoder)
LTX-2 (19B, should also work — same text encoder architecture)
Works on both NVIDIA (CUDA) and AMD (ROCm) GPUs

Limitations

This is a text encoder only — it does not add NSFW visual generation capability to the LTX-2.3 video transformer (it was intentionally trained on ONLY SFW data). The video model must still be fine-tuned separately to generate content outside its training distribution (see below for easy solution).
Cannot be used with TextGenerateLTX2Prompt — abliteration breaks instruction following for text generation. See note at top.
The abliteration removes refusal-direction alignment from the encoder weights. The resulting embeddings are more faithful to input prompts but the model may produce slightly unexpected outputs for prompts that the original model would have redirected.
Quality of quantized variants depends on the inference backend. FP4 on hardware without native FP4 support will use emulated dequantization.

Fix for Limitations

(I recommend using a LoRA rather than checkpoint, but either way works)

NSFW LoRA Option: Here's a decent option if NSFW is what you're going for without an I2V including NSFW details already (weird LoRA name, but it's for general NSFW. I take no responsibility for anything generated with any of these resources): CivitAI General NSFW LoRA
Finding Other LoRAs or Checkpoints: There are MANY options, courtesy of the generous/"interesting" CivitAI community. To actually see the options, register for CivitAI, login, and adjust maturity settings. Then filter by LTX2.3 in the models tab. Worth noting that much of the offerings are disturbing and you'll wish you could unsee them.

Credits

Lightricks for LTX-2.3
Google for Gemma-3-12b-it
Philipp Emanuel Weidmann for Heretic
Kijai for the ComfyUI-compatible packaging format this release follows

License

The model weights inherit the Gemma license. Please review Google's terms for usage restrictions.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support