ferrotorch/sd-v1-5-clip-text-encoder

Stable Diffusion 1.5 CLIP text encoder (runwayml/stable-diffusion-v1-5, text_encoder/ subfolder; the text tower of openai/clip-vit-large-patch14). 12 transformer layers, hidden_size=768, intermediate_size=3072, num_attention_heads=12, max_position_embeddings=77, vocab_size=49408, hidden_act=quick_gelu, layer_norm_eps=1e-5. Causal self-attention. ~123M-param text conditioner. RAIL-M licensed. Real-artifact baseline for SD CLIP text encoder parity vs transformers (#1152).

Provenance

  • Upstream: runwayml/stable-diffusion-v1-5 (subfolder text_encoder/), openrail.
  • Conversion script: ferrotorch/scripts/pin_pretrained_diffusion_weights.py.
  • Ferrotorch issue: https://github.com/dollspace-gay/ferrotorch/issues/1152.
  • SHA-256 of model.safetensors (this file is pinned in ferrotorch-hub/src/registry.rs): 52de4b2426c9e31a63dadec5d111f766af7304b1ab205872b060c274727861de.
  • Number of trainable parameters in the text encoder: 123,060,480.
  • Config snapshot: hidden_size=768, intermediate_size=3072, num_attention_heads=12, num_hidden_layers=12, max_position_embeddings=77, vocab_size=49408, hidden_act='quick_gelu', layer_norm_eps=1e-05.
  • Dropped upstream int64 buffer keys (not parameters on either side): ['text_model.embeddings.position_ids'].

Value-parity probe

Two extra files are uploaded so the ferrotorch-side harness can reproduce the parity verdict without re-running the upstream CLIPTextModel:

  • _value_parity_input_ids.bin โ€” pre-tokenized input ids for the fixed prompt "a photograph of an astronaut riding a horse", padded to [1, 77] with the CLIP pad/eos token. Stored as f32 (every CLIP-BPE id fits in 24 bits so the cast is lossless). Shipped so the Rust side does not need a tokenizer on the parity hot path.
  • _value_parity_last_hidden_state.bin โ€” float32 last_hidden_state [1, 77, 768] from CLIPTextModel(input_ids=input_ids, return_dict=True).last_hidden_state on float32 weights in eval mode. Same dump format as every other ferrotorch artifact: [u32 ndim][u32 ร— ndim shape][f32 ร— prod(shape)] little-endian.

How to load

use ferrotorch_diffusion::{ClipTextConfig, load_clip_text_encoder};
use ferrotorch_hub::{HubCache, hf_download_model};

let cache = HubCache::with_default_dir();
let repo_dir = hf_download_model("ferrotorch/sd-v1-5-clip-text-encoder", "main", &cache)?;
let cfg = ClipTextConfig::from_file(&repo_dir.join("config.json"))?;
let (encoder, _drop_report) = load_clip_text_encoder::<f32>(
    &repo_dir.join("model.safetensors"),
    cfg,
    /* strict = */ false,
)?;
let ids: Vec<u32> = /* CLIP-BPE tokenized prompt, length max_position_embeddings */;
let last_hidden_state = encoder.forward_from_ids(&ids)?;  // [1, 77, 768]

Upstream license

Stable Diffusion v1.5 is distributed under the CreativeML Open RAIL-M license. The decoder slice mirrored here inherits that license โ€” see https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/LICENSE for the full terms.

Downloads last month
79
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support