YifanXu
/

libra-vision-tokenizer

Model card Files Files and versions

yifanxu commited on May 16, 2024

Commit

a97a793

·

1 Parent(s): 9598fd3

model version 1.0

Files changed (3) hide show

README.md +28 -3
vision_tokenizer_config.yaml +23 -0
vqgan.ckpt +3 -0

README.md CHANGED Viewed

@@ -1,3 +1,28 @@
----
-license: apache-2.0
----

+## Libra Vision Tokenizer
+This repo provides the pretrained weight of Libra vision tokenizer trained with lookup-free quantization.
+### !!! NOTE !!!
+1. Please merge the weights into ``llama-2-7b-chat-hf-libra`` (the huggingface version of LLaMA2-7B).
+2. Please download the pretrained CLIP model in huggingface and merge it into the path. The CLIP model can be downloaded [here](https://huggingface.co/openai/clip-vit-large-patch14-336).
+The files should be organized as:
+```
+llama-2-7b-chat-hf-libra/
+|
+│   # original llama files
+|
+├── ...
+│
+│   # newly added vision tokenizer
+│
+├── vision_tokenizer_config.yaml
+├── vqgan.ckpt
+│
+│   # CLIP model
+│
+└── openai-clip-vit-large-patch14-336/
+    └── ...
+```

vision_tokenizer_config.yaml ADDED Viewed

	@@ -0,0 +1,23 @@

+freeze: True
+max_vision_token_length: 578 # 24*24 (resolution) + 2 (<img> and <\img>); corresponding to model_config.max_vision_token_length, dataset_config.image_size
+params:
+  embed_dim: 1024 # debug
+  ckpt_path: vqgan.ckpt
+  codebook_size: 512
+  num_codebook: 2
+  ddconfig:
+    # only_auto_encoder: True
+    encoder_name: openai-clip-vit-large-patch14-336
+    select_layer: [2,10,18,22]
+    double_z: False
+    z_channels: 1024
+    resolution: 336 # 336
+    in_channels: 3
+    out_ch: 3
+    ch: 128
+    ch_mult: [ 1,1,2,4,8]  # num_down = len(ch_mult)-1
+    num_res_blocks: 2
+    attn_resolutions: [24]
+    dropout: 0.0
+    initial_resolution: 24
+    num_attn_head: 8

vqgan.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d01a38fadd81dec3557120ec6e8d36d51758ac1a8a8afe58102f404d03e47a08
+size 3247360961