File size: 2,811 Bytes
2f92a88
 
 
 
4dc8b0d
66e9550
 
 
2f92a88
 
 
 
 
 
 
 
2a66a8f
 
e3f79b0
 
104a368
 
b10540f
 
 
 
66e9550
 
 
 
cc1f011
 
 
 
 
 
fff074b
cc1f011
 
 
 
 
 
 
263104d
72a0e8d
 
 
2205ec7
72a0e8d
6872224
 
 
726b86a
6872224
 
 
 
 
 
c5502c4
 
 
 
21e24fc
c5502c4
2f92a88
 
987b785
 
 
 
66e9550
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
license: apache-2.0
tags:
- anima
- modernbert
base_model_relation: finetune
base_model:
- circlestone-labs/Anima
---

# Cosmos BERT

BERT for Anima/Cosmos.

This is *not* an adapter model, but rather an early replacement for the T5/Qwen model.

This means that the T5, Qwen, and LLM adapter files are about to say goodbye.

It was trained on both T5 (text) and the [AnimaTextToImagePipeline](https://huggingface.co/nightknocker/tdrussell-secret-model-diffusers) (text-image pairs).

![](images/preview.png)

## LoRA support

Character adapters created by kohya-ss/sd-scripts are compatible with the BERT text encoder. This new text encoder seemingly recognizes the [trigger words](https://huggingface.co/datasets/newtextdoc1111/danbooru-tag-csv) without issue.

## Mixing @tags

![](images/mix.png)

## What has changed

#### CLIP and LongCLIP

- Read the model configuration. Note that the token length is no longer limited to 77 or [248](https://huggingface.co/nightknocker/sdxs-1b-image-to-longclip-encoder).

### SD models

- Compared to the old CLIPTextModel, it supports longer text input and has a modernized architecture.

- See the References section. None of the retrained text encoders has poorer text understanding than the CLIP models. Furthermore, they demonstrated improved understanding of [gestures, spatial relations, and colors](https://huggingface.co/nightknocker/rosaceae-t5gemma-adapter).

## Z-Image and Qwen

- LLMs have redundant knowledge (2511.07384, 2403.03853). Thus, resorting to smaller language models does not result in irrecoverable knowledge loss, as has been [demonstrated](https://huggingface.co/nightknocker/recurrent-qwen3-z-image-turbo). This is particularly true for specialized anime models.

## Subject-Focused Attention

- In an SVO sentence structure, CLIPs focus too much on the subject, text encoders are undertrained for certain verbs and cannot reliably identify the object's position.

## Inference

```python
# Use the default ModernBertConfig.
bert = CosmosBert.from_pretrained('nightknocker/cosmos-bert')
tokenizer = AutoTokenizer.from_pretrained('nightknocker/cosmos-bert')
inputs = tokenizer(text, return_tensors='pt').to('cuda')
crossattn_emb = bert.forward(**inputs, return_dict=True).last_hidden_state
```

## References

- [Recurrent Qwen](https://huggingface.co/nightknocker/recurrent-qwen3-z-image-turbo)
- [Recurrent Gemma](https://huggingface.co/nightknocker/recurrent-t5gemma-l-l-ul2-encoder)
- [Rosaceae](https://huggingface.co/nightknocker/rosaceae-t5gemma-adapter)

## Datasets

- anime-art-multicaptions (multicharacter interactions)
- danbooru2025-metadata
- danbooru wikis full
- [eyes](https://huggingface.co/datasets/nightknocker/anima-eyes-never-lie)
- [rouwei 0.8](https://huggingface.co/datasets/nightknocker/rouwei-eyes-never-lie)