Cosmos BERT
BERT for Anima/Cosmos.
This is not an adapter model, but rather an early replacement for the T5/Qwen model.
This means that the T5, Qwen, and LLM adapter files are about to say goodbye.
It was trained on both T5 (text) and the AnimaTextToImagePipeline (text-image pairs).
LoRA support
Character adapters created by kohya-ss/sd-scripts are compatible with the BERT text encoder. This new text encoder seemingly recognizes the trigger words without issue.
Mixing @tags
What has changed
CLIP and LongCLIP
- Read the model configuration. Note that the token length is no longer limited to 77 or 248.
SD models
Compared to the old CLIPTextModel, it supports longer text input and has a modernized architecture.
See the References section. None of the retrained text encoders has poorer text understanding than the CLIP models. Furthermore, they demonstrated improved understanding of gestures, spatial relations, and colors.
Z-Image and Qwen
- LLMs have redundant knowledge (2511.07384, 2403.03853). Thus, resorting to smaller language models does not result in irrecoverable knowledge loss, as has been demonstrated. This is particularly true for specialized anime models.
Subject-Focused Attention
- In an SVO sentence structure, CLIPs focus too much on the subject, text encoders are undertrained for certain verbs and cannot reliably identify the object's position.
Inference
# Use the default ModernBertConfig.
bert = CosmosBert.from_pretrained('nightknocker/cosmos-bert')
tokenizer = AutoTokenizer.from_pretrained('nightknocker/cosmos-bert')
inputs = tokenizer(text, return_tensors='pt').to('cuda')
crossattn_emb = bert.forward(**inputs, return_dict=True).last_hidden_state
References
Datasets
- anime-art-multicaptions (multicharacter interactions)
- danbooru2025-metadata
- danbooru wikis full
- eyes
- rouwei 0.8
Model tree for nightknocker/cosmos-bert
Base model
circlestone-labs/Anima
