Safety Warning & Terms of Access

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

This model has safety filtering removed and can generate General NSFW content. By accessing this model, you agree to: (1) Use it responsibly and legally, (2) Not use it to create illegal content, (3) Comply with all applicable laws in your country.

ERNIE-Image Abliterated

Tips are greatly appreciated and help sustain the compute resources needed for further research!

Read this in other languages: 日本語 (Japanese)

This repository provides an "Abliterated" (uncensored) version of the text encoders used in the baidu/ERNIE-Image model. By surgically removing the safety filter refusal vectors from the model weights, this version allows unconstrained image generation instructions to pass directly to the DiT (Diffusion Transformer) engine.

Overview

Unlike standard Diffusion models, ERNIE-Image employs a unique architecture with two distinct text processing stages:

Prompt Enhancer (PE): Based on the Ministral3 architecture, responsible for interpreting and expanding brief user inputs.
Text Encoder (TE): Based on the Mistral3 architecture, responsible for encoding the expanded text into embeddings for the DiT.

Because safety alignments and refusal logic can trigger in either of these stages, both models have been abliterated in this repository.

Mechanism: Abliteration (Orthogonalization)

This project uses the mathematically precise Orthogonalization of Concept Vectors methodology (similar to the approach used in flux2-klein-4b-uncensored). No fine-tuning was performed.

Refusal Vector Extraction: We fed pairs of 8 "harmful/extreme" and 8 "harmless/general" prompts to both the PE and TE. By analyzing the layer-by-layer hidden states, we calculated the L2 Norm of the refusal vector across all layers.
Surgical Orthogonalization: By calculating the L2 norm of the difference vector at each layer, we observed the following common trend across both PE and TE:
- Layer 0〜15: Norm stays roughly flat between 0.2 and 3.8 (forming the semantic baseline).
- Layer 16〜20: Norm gradually increases from 4.2 to 5.9.
- Layer 21〜25: Norm spikes exponentially (6.3 → 7.1 → 9.2 → 11.3 → 13.3) and culminates in a massive refusal spike of ~79.8 at Layer 26.
Based on this undeniable data, we targeted only these top 5 spiked layers (layers 21 through 25) instead of performing a broad "carpet bombing" of the model. We achieved a true surgical intervention by mathematically subtracting the projection component of the refusal vector from the weight matrices of two critical components:
- Attention Output Projection (o_proj)
- MLP Down Projection (down_proj)

Mathematical Proof of `down_proj` / `o_proj` Orthogonalization

In the Mistral/Ministral architecture, both o_proj and down_proj project their respective intermediate transformations back into the model's main hidden dimension (the residual stream). Because the rows of these weight matrices (out_features, in_features) correspond directly to the output space, we can mathematically sever the model's ability to output in the refusal direction. By applying the transformation new_weight = weight - outer(direction, direction^T @ weight), we rigorously ensure that any vector output from these specific layers will have exactly zero magnitude in the targeted refusal direction.

Verification Methodology

To provide undeniable proof of the abliteration, we abandoned simple "Cosine Similarity" for evaluating the filter's removal. While Cosine Similarity is useful for confirming general model integrity (where >0.99 means the model is not broken), it is a weak metric for proving the removal of a specific, narrow vector in a high-dimensional space.

Instead, we verified the removal by analyzing the Projection (Dot Product) of the model's output differences directly onto the extracted Refusal Vector.

Verification Results (Projection on Refusal Direction):

Metric: Projection of the (Harmful Mean - Harmless Mean) vector onto the isolated Refusal Direction.

Prompt Enhancer (PE):
- Original Model Refusal Activation: 79.81
- Abliterated Model Refusal Activation: 64.28
- Harmless Prompts Avg Cosine Similarity: 0.9979
Text Encoder (TE):
- Original Model Refusal Activation: 79.74
- Abliterated Model Refusal Activation: 64.52
- Harmless Prompts Avg Cosine Similarity: 0.9974

Analysis: Why ~65 and not 0?

You might wonder why the abliterated projection drops to ~~65 instead of 0. The value of `~~65` represents the baseline semantic distance. It is the natural mathematical gap between extreme/violent concepts (e.g., murder) and peaceful concepts (e.g., a sunset) established deep within the early embedding layers.

The original model's score of ~79.8 consists of this natural semantic baseline (65) PLUS an artificial spike of ~15. This +15 spike is the safety filter aggressively pivoting the vector to output a refusal.

Our Abliterated models successfully neutralized this +15 artificial refusal spike entirely, returning the vector exactly to its natural semantic baseline. Simultaneously, the ~0.995 Cosine Similarity on harmless prompts definitively proves that the fundamental linguistic and descriptive capabilities of the encoders remain perfectly intact (not lobotomized).

Repository Structure

abliterated_pe/: The modified Prompt Enhancer model files (Safetensors).
abliterated_text_encoder/: The modified Text Encoder model files (Safetensors).

Usage (Python / Diffusers)

To use these unconstrained encoders in your local Diffusers pipeline, replace the default components with the local directories (or Hugging Face repository paths) of this project:

import torch
from diffusers import ErnieImagePipeline
from transformers import AutoModelForCausalLM, AutoModel

model_id = "baidu/ERNIE-Image"
# Replace with the repository ID where this model is hosted, e.g., "username/ERNIE-Image-Abliterated"
abliterated_repo_id = "your_username/ERNIE-Image-Abliterated"

# Load the abliterated Prompt Enhancer
pe = AutoModelForCausalLM.from_pretrained(f"{abliterated_repo_id}/abliterated_pe", torch_dtype=torch.bfloat16)

# Load the abliterated Text Encoder
te = AutoModel.from_pretrained(f"{abliterated_repo_id}/abliterated_text_encoder", torch_dtype=torch.bfloat16)

# Load the full pipeline, swapping in our modified components
pipe = ErnieImagePipeline.from_pretrained(
    model_id,
    pe=pe,
    text_encoder=te,
    torch_dtype=torch.bfloat16,
)
pipe = pipe.to("cuda")

# Now the pipeline will process extreme/NSFW prompts without rejection from the text encoders.

Important Note

This modification ensures the Prompt Enhancer and Text Encoder will not reject instructions. However, the core rendering engine (DiT) will only draw what it has learned in its dataset. If the DiT lacks the visual "knowledge" of certain NSFW or extreme concepts due to dataset scrubbing, the output may result in noise or anatomical collapse. In such cases, a targeted LoRA is required.

Furthermore, by utilizing these abliterated encoders, we have conducted empirical testing to verify the actual visual knowledge and the presence or absence of internal guardrails within the DiT rendering engine itself. Please refer to the ERNIE-Image Guardrail Verification Repository for detailed verification results and insights proving the DiT's knowledge gap.

Disclaimer & Terms of Use

This repository provides mathematically modified model weights strictly for academic research, security verification, and understanding the internal mechanics of Diffusion Transformers.

No Affiliation: This project is completely independent and is NOT affiliated with, endorsed by, or associated with Baidu, Inc.
User Responsibility: The creator of this repository (the author) assumes ABSOLUTELY NO RESPONSIBILITY for any consequences, damages, or outputs generated by the use of these modified models. You, the user, bear full and sole responsibility for ensuring that your use of this model complies with all applicable local, national, and international laws.
Prohibited Uses: You are strictly prohibited from using this model to generate illegal content, non-consensual deepfakes, child sexual abuse material (CSAM), harassment, or any content that violates the human rights of others.
Original License: These weights are a derivative work of the baidu/ERNIE-Image model. Your use of these weights must also comply with the original Baidu ERNIE-Image License and any associated Acceptable Use Policies (AUP).

日本語 (Japanese)

概要 (Overview)

本リポジトリは、baidu/ERNIE-Imageモデルで使用されているテキストエンコーダーの「Abliterated」バージョンを提供します。モデルの重みからセーフティフィルターの拒絶ベクトルを外科的に除去することで、無制限の画像生成インストラクションを直接DiT (Diffusion Transformer) エンジンに渡すことが可能になります。

標準的なDiffusionモデルとは異なり、ERNIE-Imageは2つの異なるテキスト処理ステージを持つ独自のアーキテクチャを採用しています：

Prompt Enhancer (PE): Ministral3アーキテクチャベース。短いユーザー入力を解釈・拡張する役割。
Text Encoder (TE): Mistral3アーキテクチャベース。拡張されたテキストをDiT用のエンベディングにエンコードする役割。

セーフティライメントや拒絶ロジックはこれらのどちらのステージでも発動する可能性があるため、本リポジトリでは両方のモデルに対してAbliterationを施しています。

メカニズム: Abliteration (概念ベクトルの直交化)

本プロジェクトでは、数学的に厳密な概念ベクトルの直交化 (Orthogonalization of Concept Vectors) の手法を使用しています (アプローチとしてはflux2-klein-4b-uncensoredで使用されたものと同様です)。ファインチューニングは一切行っていません。

拒絶ベクトルの抽出: 「有害/過激」なプロンプトと「無害/一般的」なプロンプトの8ペアをPEとTEの両方に入力しました。層ごとの隠れ状態を分析することで、全層にわたる拒絶ベクトルのL2ノルムを計算しました。
外科的直交化: 各層の差分ベクトルのL2ノルムを計算した結果、PEとTEの両方で以下のような共通の傾向が観察されました：
- 層 0〜15: ノルムは0.2から3.8の間でほぼ平坦 (意味論的なベースラインを形成)。
- 層 16〜20: ノルムは4.2から5.9へと徐々に増加。
- 層 21〜25: ノルムが指数関数的に急増 (6.3 → 7.1 → 9.2 → 11.3 → 13.3) し、層26で~79.8という巨大な拒絶のスパイク (急上昇) に達する。
この明確なデータに基づき、モデル全体への広範な「絨毯爆撃」を行うのではなく、ノルムが急増している上位5層 (層21から層25) のみをターゲットにしました。2つの重要なコンポーネントの重み行列から、拒絶ベクトルの射影成分を数学的に減算することで、真の外科的介入を実現しました：
- Attention出力射影 (o_proj)
- MLPダウン射影 (down_proj)

`down_proj` / `o_proj` 直交化の数学的証明

Mistral/Ministralアーキテクチャでは、o_projとdown_projの両方が、それぞれの中間変換をモデルのメインの隠れ次元（残差ストリーム）に投影して戻します。これらの重み行列の行 (out_features, in_features) は出力空間に直接対応しているため、拒絶方向へ出力するモデルの能力を数学的に切断することが可能です。 new_weight = weight - outer(direction, direction^T @ weight) という変換を適用することで、これらの特定の層からのいかなるベクトル出力も、ターゲットとした拒絶方向に対して正確にゼロの大きさを持つことを厳密に保証しています。

検証手法

Abliterationの確固たる証明を提供するため、フィルター除去の評価において単純な「コサイン類似度 (Cosine Similarity)」の使用は放棄しました。コサイン類似度は一般的なモデルの完全性を確認するのには有用ですが (>0.99はモデルが壊れていないことを意味する)、高次元空間における特定の狭いベクトルが除去されたことを証明するには弱い指標です。

代わりに、モデルの出力差分を、抽出した拒絶ベクトルに直接 射影 (内積) し、その数値を分析することで除去を検証しました。

検証結果 (拒絶方向への射影):

指標: (有害な平均 - 無害な平均) ベクトルの、分離された拒絶方向への射影

Prompt Enhancer (PE):
- 元モデルの拒絶アクティベーション: 79.81
- Abliteratedモデルの拒絶アクティベーション: 64.28
- 無害プロンプトの平均コサイン類似度: 0.9979
Text Encoder (TE):
- 元モデルの拒絶アクティベーション: 79.74
- Abliteratedモデルの拒絶アクティベーション: 64.52
- 無害プロンプトの平均コサイン類似度: 0.9974

分析: なぜ0ではなく約65なのか？

なぜAbliteratedモデルの射影が0ではなく~~65に下がるのか、疑問に思うかもしれません。この `~~65` という値は、ベースラインとなる意味論的距離 を表しています。これは、過激/暴力的な概念 (例: 殺人) と平和的な概念 (例: 夕日) との間の、初期の埋め込み層の奥深くで確立された自然な数学的ギャップです。

元のモデルのスコア ~79.8 は、この自然な意味論的ベースライン (65) に 人工的な約15のスパイクが追加されたもの で構成されています。この+15のスパイクこそが、拒絶を出力するためにベクトルを強引にピボットさせているセーフティフィルターです。

私たちのAbliteratedモデルは、この+15の人工的な拒絶スパイクを完全に無力化し、ベクトルを正確に自然な意味論的ベースラインに戻すことに成功しました。同時に、無害なプロンプトでの ~0.995 のコサイン類似度は、エンコーダーの根本的な言語能力や描写能力が完全に保たれている (ロボトミー化されていない) ことを決定的に証明しています。

リポジトリ構成

abliterated_pe/: 変更されたPrompt Enhancerモデルファイル (Safetensors)。
abliterated_text_encoder/: 変更されたText Encoderモデルファイル (Safetensors)。

使い方 (Python / Diffusers)

ローカルのDiffusersパイプラインでこれらの無制限エンコーダーを使用するには、デフォルトのコンポーネントを、本プロジェクトのモデルに置き換えます。

import torch
from diffusers import ErnieImagePipeline
from transformers import AutoModelForCausalLM, AutoModel

model_id = "baidu/ERNIE-Image"
# モデルがホストされているリポジトリID（例: "ユーザー名/ERNIE-Image-Abliterated"）
abliterated_repo_id = "your_username/ERNIE-Image-Abliterated"

# Abliterated Prompt Enhancer のロード
pe = AutoModelForCausalLM.from_pretrained(f"{abliterated_repo_id}/abliterated_pe", torch_dtype=torch.bfloat16)

# Abliterated Text Encoder のロード
te = AutoModel.from_pretrained(f"{abliterated_repo_id}/abliterated_text_encoder", torch_dtype=torch.bfloat16)

# フルパイプラインのロード (コンポーネントを差し替え)
pipe = ErnieImagePipeline.from_pretrained(
    model_id,
    pe=pe,
    text_encoder=te,
    torch_dtype=torch.bfloat16,
)
pipe = pipe.to("cuda")

# これで、テキストエンコーダーから拒否されることなく、過激/NSFWなプロンプトを処理できるようになります。

重要な注意点

この変更により、Prompt EnhancerとText Encoderが命令を拒絶しなくなることが保証されます。ただし、コアとなるレンダリングエンジン (DiT) は、自身のデータセットで学習した内容しか描画できません。データセットのスクラビング (削除) により、DiTが特定のNSFWや過激な概念の視覚的な「知識」を欠いている場合、出力はノイズや構造の崩壊をもたらす可能性があります。そのような場合は、ターゲットを絞ったLoRAが必要となります。

また、本テキストエンコーダーを利用することで、DiT自体が保有する視覚的知識の限界や、内部ガードレールの有無を検証する実証実験も実施しています。DiTの「知識の欠如 (Knowledge Gap)」を証明した詳細な検証結果や知見については、ERNIE-Image Guardrail Verification Repository をご参照ください。

免責事項および利用規約

本リポジトリは、学術研究、セキュリティ検証、およびDiffusion Transformerの内部メカニズムの理解を厳密な目的として、数学的に改変されたモデルの重みを提供します。

提携関係の否認 (No Affiliation): 本プロジェクトは完全に独立したものであり、Baidu, Inc. とは一切の提携、承認、または関連を持っていません。
利用者の責任 (User Responsibility): 本リポジトリの作成者（著者）は、これらの改変されたモデルの使用によって生じた結果、損害、または生成物に対して一切の責任を負いません。利用者自身が、本モデルの使用に関して適用されるすべての地域、国、および国際的な法律を遵守することに、完全かつ単独で責任を負うものとします。
禁止事項 (Prohibited Uses): 本モデルを使用して、違法なコンテンツ、同意のないディープフェイク、児童性的虐待素材 (CSAM)、ハラスメント、または他者の人権を侵害するいかなるコンテンツを生成することも固く禁じます。
元のライセンス (Original License): このモデルは baidu/ERNIE-Image の派生物です。使用は、元の Baidu ERNIE-Image ライセンスおよび関連する利用規定 (Acceptable Use Policies: AUP) にも準拠する必要があります。

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for ponpoke/ERNIE-Image-Abliterated

Base model

baidu/ERNIE-Image

Finetuned

(11)

this model