Instructions to use sanjeevnv/multimodal-pretraining with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sanjeevnv/multimodal-pretraining with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("sanjeevnv/multimodal-pretraining", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Configuration Parsing Warning:Invalid JSON for config file config.json
Multimodal Pretraining Tokenizer
This repository is configured as a pretraining tokenizer.
Tokenizer-level behavior:
- BOS is disabled:
bos_token = None,bos_token_id = None. - EOS/EOD is
</s>with token id2. generation_config.jsonuseseos_token_id = 2.- Plain tokenization does not automatically add BOS or EOS, even with
add_special_tokens=True. - The pretraining chat template appends
</s>after assistant content. chat_template.jinjais present and is the template file Transformers prioritizes over the JSONchat_template.
<|im_end|> remains in the vocabulary as token id 11, but it is not the EOS/EOD token for this pretraining tokenizer.
Raw Tokenization
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("sanjeevnv/multimodal-pretraining", use_fast=True)
tok.encode("Hello World", add_special_tokens=False)
# [22177, 5325]
tok.encode("Hello World", add_special_tokens=True)
# [22177, 5325]
tok.convert_ids_to_tokens([22177, 5325])
# ["Hello", "ĠWorld"]
To include EOD/EOS in raw text, include </s> explicitly:
tok.encode("Hello World</s>", add_special_tokens=True)
# [22177, 5325, 2]
tok.convert_ids_to_tokens([22177, 5325, 2])
# ["Hello", "ĠWorld", "</s>"]
Messages Template
For messages-style formatting, use apply_chat_template. The current template is a pretraining question/answer format, not a ChatML post-training format.
messages = [
{"role": "system", "content": "You are concise."},
{"role": "user", "content": "Hello World"},
{"role": "assistant", "content": "Hi."},
]
rendered = tok.apply_chat_template(messages, tokenize=False)
print(rendered)
Rendered text:
You are concise.
question: Hello World
answer: Hi.</s>
Tokenized output:
ids = tok.apply_chat_template(messages, tokenize=True)
ids
# [4568, 1584, 104335, 1626, 23653, 1058, 45383, 5325, 1010, 24613, 1058, 24665, 1046, 2]
tok.convert_ids_to_tokens(ids)
# ["You", "Ġare", "Ġconcise", ".Ċ", "question", ":", "ĠHello", "ĠWorld", "Ċ", "answer", ":", "ĠHi", ".", "</s>"]
With return_assistant_tokens_mask=True, the assistant content and </s> are marked as assistant tokens:
encoded = tok.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_assistant_tokens_mask=True,
)
encoded["assistant_masks"]
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]
- Downloads last month
- 57
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support