| | --- |
| | license: apache-2.0 |
| | pipeline_tag: image-text-to-text |
| | library_name: transformers |
| | tags: |
| | - SAIL |
| | --- |
| | |
| | # SAIL |
| |
|
| | [\[📂 GitHub\]](https://github.com/bytedance/SAIL) |
| | [\[📜 paper\]](https://arxiv.org/abs/2504.10462) |
| | [\[🚀 Quick Start\]](#quick-start) |
| |
|
| |
|
| |
|
| | ## Introduction |
| |
|
| | SAIL is a **S**ingle tr**A**nsformer model for v**I**sion and **L**anguage. It is a unified multimodal large language model (MLLM) that seamlessly integrates raw pixel encoding and language decoding within a single architecture. **Without relying on pre-trained vision encoders**, SAIL achieves competitive performance across a wide range of vision-language tasks and demonstrates strong visual representation, rivaling state-of-the-art vision models in tasks like semantic segmentation. |
| |
|
| | ## Model |
| |
|
| | | Model Name | HF Link | |
| | |:----------:|:------------------------------------------------------------------:| |
| | | SAIL-7B | [🤗 link](https://huggingface.co/ByteDance-Seed/SAIL-7B) | |
| |
|
| |
|
| |
|
| | ## Quick Start |
| |
|
| | We provide an example code to run `SAIL`. |
| |
|
| | ```python |
| | from example import * |
| | |
| | NON_VISION_TOKEN_ID = -1 |
| | PATH_TO_MODEL = "path to model" |
| | PATH_TO_TOKENIZER = "path to tokenizer" |
| | IMAGE_PATH = "path to image" |
| | PROMPT = "content of prompt" |
| | |
| | model, tokenizer = get_transformer_and_tokenizer( |
| | PATH_TO_MODEL, |
| | PATH_TO_TOKENIZER |
| | ) |
| | model = model.cuda() |
| | |
| | image_processor = lambda x: convert_image_base64_to_patches(load_image_to_base64(x), model.config.vision_patch_size, fix_res_size=None) |
| | prompt_inp = tokenizer.bos_token + '[INST] {} [/INST]'.format(PROMPT) |
| | image_path = IMAGE_PATH |
| | image_patches = image_processor(image_path) |
| | nh, nw = image_patches.shape[:2] |
| | image_tokens, image_tokens_len = prepare_image_textual_seq_norowsep(nh, nw, tokenizer, add_cls=False) |
| | |
| | input_tokens = image_tokens + prompt_inp |
| | input_ids = tokenizer(input_tokens, add_special_tokens=False, return_tensors="pt").input_ids |
| | vision_patch_indices = torch.full_like(input_ids, fill_value=NON_VISION_TOKEN_ID) |
| | vision_patches = image_patches.view(nh * nw, -1) |
| | assert (input_ids == tokenizer.vis_patch_tok_id).sum() == vision_patches.size(0) |
| | assert (input_ids >= tokenizer.vis_beg_tok_id).sum() == image_tokens_len |
| | |
| | vision_patch_indices[input_ids==tokenizer.vis_patch_tok_id] = torch.arange(vision_patches.size(0)) |
| | attention_mask = create_single_prefix_mask(image_tokens_len, input_ids.size(-1)).unsqueeze(0).unsqueeze(0) |
| | position_ids = generate_mm_pos_ids_singleit(input_ids.squeeze(0).numpy().tolist(), tokenizer.vis_patch_tok_id, nh, nw).unsqueeze(1) |
| | |
| | input_ids = input_ids.long().cuda() |
| | vision_patch_indices = vision_patch_indices.long().cuda() |
| | vision_patches = vision_patches.to(torch.bfloat16).cuda() |
| | position_ids = position_ids.long().cuda() |
| | attention_mask = attention_mask.cuda() |
| | |
| | padding_attention_mask = torch.ones_like(input_ids).cuda() |
| | |
| | inputs = dict( |
| | input_ids = input_ids, |
| | position_ids = position_ids, |
| | attention_mask = padding_attention_mask, |
| | vision_patches = vision_patches, |
| | vision_patch_indices = vision_patch_indices, |
| | use_cache=True |
| | ) |
| | |
| | cached_inputs = dict( |
| | input_ids = input_ids[:, :image_tokens_len], |
| | position_ids = position_ids[:, :, :image_tokens_len], |
| | attention_mask = attention_mask[:,:, :image_tokens_len, :image_tokens_len], |
| | vision_patches = vision_patches, |
| | vision_patch_indices = vision_patch_indices[:, :image_tokens_len], |
| | use_cache=True |
| | ) |
| | |
| | prefix_cache = DynamicCache() |
| | with torch.no_grad(): |
| | prefix_cache = model.forward(**cached_inputs, past_key_values=prefix_cache).past_key_values |
| | |
| | past_key_values = copy.deepcopy(prefix_cache) |
| | generate_config = GenerationConfig( |
| | max_new_tokens=1024, |
| | return_dict_in_generate=True, |
| | output_attentions=False |
| | ) |
| | generated = model.generate( |
| | **inputs, |
| | past_key_values=past_key_values, |
| | generation_config=generate_config |
| | ) |
| | generated_ids = generated['sequences'][:, input_ids.size(1):] |
| | response = tokenizer.batch_decode( |
| | generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False |
| | )[0] |
| | |
| | print(f"\nModel Response: ===\n{response}\n===") |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | If you find this project useful in your research, please consider citing: |
| |
|
| | ```BibTeX |
| | @article{lei2025sail, |
| | title={The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer}, |
| | author={Lei, Weixian and Wang, Jiacong and Wang, Haochen and Li, Xiangtai and Liew, Jun Hao and Feng, Jiashi and Huang, Zilong}, |
| | journal={arXiv preprint arXiv:2504.10462}, |
| | year={2025} |
| | } |
| | ``` |