nightknocker
/

cosmos-bert

Model card Files Files and versions

cosmos-bert / README.md

nightknocker's picture

Update README.md

66e9550 verified about 17 hours ago

|

2.81 kB

	---
	license: apache-2.0
	tags:
	- anima
	- modernbert
	base_model_relation: finetune
	base_model:
	- circlestone-labs/Anima
	---

	# Cosmos BERT

	BERT for Anima/Cosmos.

	This is not an adapter model, but rather an early replacement for the T5/Qwen model.

	This means that the T5, Qwen, and LLM adapter files are about to say goodbye.

	It was trained on both T5 (text) and the [AnimaTextToImagePipeline](https://huggingface.co/nightknocker/tdrussell-secret-model-diffusers) (text-image pairs).

	![](images/preview.png)

	## LoRA support

	Character adapters created by kohya-ss/sd-scripts are compatible with the BERT text encoder. This new text encoder seemingly recognizes the [trigger words](https://huggingface.co/datasets/newtextdoc1111/danbooru-tag-csv) without issue.

	## Mixing @tags

	![](images/mix.png)

	## What has changed

	#### CLIP and LongCLIP

	- Read the model configuration. Note that the token length is no longer limited to 77 or [248](https://huggingface.co/nightknocker/sdxs-1b-image-to-longclip-encoder).

	### SD models

	- Compared to the old CLIPTextModel, it supports longer text input and has a modernized architecture.

	- See the References section. None of the retrained text encoders has poorer text understanding than the CLIP models. Furthermore, they demonstrated improved understanding of [gestures, spatial relations, and colors](https://huggingface.co/nightknocker/rosaceae-t5gemma-adapter).

	## Z-Image and Qwen

	- LLMs have redundant knowledge (2511.07384, 2403.03853). Thus, resorting to smaller language models does not result in irrecoverable knowledge loss, as has been [demonstrated](https://huggingface.co/nightknocker/recurrent-qwen3-z-image-turbo). This is particularly true for specialized anime models.

	## Subject-Focused Attention

	- In an SVO sentence structure, CLIPs focus too much on the subject, text encoders are undertrained for certain verbs and cannot reliably identify the object's position.

	## Inference

	```python
	# Use the default ModernBertConfig.
	bert = CosmosBert.from_pretrained('nightknocker/cosmos-bert')
	tokenizer = AutoTokenizer.from_pretrained('nightknocker/cosmos-bert')
	inputs = tokenizer(text, return_tensors='pt').to('cuda')
	crossattn_emb = bert.forward(**inputs, return_dict=True).last_hidden_state
	```

	## References

	- [Recurrent Qwen](https://huggingface.co/nightknocker/recurrent-qwen3-z-image-turbo)
	- [Recurrent Gemma](https://huggingface.co/nightknocker/recurrent-t5gemma-l-l-ul2-encoder)
	- [Rosaceae](https://huggingface.co/nightknocker/rosaceae-t5gemma-adapter)

	## Datasets

	- anime-art-multicaptions (multicharacter interactions)
	- danbooru2025-metadata
	- danbooru wikis full
	- [eyes](https://huggingface.co/datasets/nightknocker/anima-eyes-never-lie)
	- [rouwei 0.8](https://huggingface.co/datasets/nightknocker/rouwei-eyes-never-lie)