|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- anima |
|
|
- modernbert |
|
|
base_model_relation: finetune |
|
|
base_model: |
|
|
- circlestone-labs/Anima |
|
|
--- |
|
|
|
|
|
# Cosmos BERT |
|
|
|
|
|
BERT for Anima/Cosmos. |
|
|
|
|
|
This is *not* an adapter model, but rather an early replacement for the T5/Qwen model. |
|
|
|
|
|
This means that the T5, Qwen, and LLM adapter files are about to say goodbye. |
|
|
|
|
|
It was trained on both T5 (text) and the [AnimaTextToImagePipeline](https://huggingface.co/nightknocker/tdrussell-secret-model-diffusers) (text-image pairs). |
|
|
|
|
|
 |
|
|
|
|
|
## LoRA support |
|
|
|
|
|
Character adapters created by kohya-ss/sd-scripts are compatible with the BERT text encoder. This new text encoder seemingly recognizes the [trigger words](https://huggingface.co/datasets/newtextdoc1111/danbooru-tag-csv) without issue. |
|
|
|
|
|
## Mixing @tags |
|
|
|
|
|
 |
|
|
|
|
|
## What has changed |
|
|
|
|
|
#### CLIP and LongCLIP |
|
|
|
|
|
- Read the model configuration. Note that the token length is no longer limited to 77 or [248](https://huggingface.co/nightknocker/sdxs-1b-image-to-longclip-encoder). |
|
|
|
|
|
### SD models |
|
|
|
|
|
- Compared to the old CLIPTextModel, it supports longer text input and has a modernized architecture. |
|
|
|
|
|
- See the References section. None of the retrained text encoders has poorer text understanding than the CLIP models. Furthermore, they demonstrated improved understanding of [gestures, spatial relations, and colors](https://huggingface.co/nightknocker/rosaceae-t5gemma-adapter). |
|
|
|
|
|
## Z-Image and Qwen |
|
|
|
|
|
- LLMs have redundant knowledge (2511.07384, 2403.03853). Thus, resorting to smaller language models does not result in irrecoverable knowledge loss, as has been [demonstrated](https://huggingface.co/nightknocker/recurrent-qwen3-z-image-turbo). This is particularly true for specialized anime models. |
|
|
|
|
|
## Subject-Focused Attention |
|
|
|
|
|
- In an SVO sentence structure, CLIPs focus too much on the subject, text encoders are undertrained for certain verbs and cannot reliably identify the object's position. |
|
|
|
|
|
## Inference |
|
|
|
|
|
```python |
|
|
# Use the default ModernBertConfig. |
|
|
bert = CosmosBert.from_pretrained('nightknocker/cosmos-bert') |
|
|
tokenizer = AutoTokenizer.from_pretrained('nightknocker/cosmos-bert') |
|
|
inputs = tokenizer(text, return_tensors='pt').to('cuda') |
|
|
crossattn_emb = bert.forward(**inputs, return_dict=True).last_hidden_state |
|
|
``` |
|
|
|
|
|
## References |
|
|
|
|
|
- [Recurrent Qwen](https://huggingface.co/nightknocker/recurrent-qwen3-z-image-turbo) |
|
|
- [Recurrent Gemma](https://huggingface.co/nightknocker/recurrent-t5gemma-l-l-ul2-encoder) |
|
|
- [Rosaceae](https://huggingface.co/nightknocker/rosaceae-t5gemma-adapter) |
|
|
|
|
|
## Datasets |
|
|
|
|
|
- anime-art-multicaptions (multicharacter interactions) |
|
|
- danbooru2025-metadata |
|
|
- danbooru wikis full |
|
|
- [eyes](https://huggingface.co/datasets/nightknocker/anima-eyes-never-lie) |
|
|
- [rouwei 0.8](https://huggingface.co/datasets/nightknocker/rouwei-eyes-never-lie) |