Configuration Parsing Warning:Invalid JSON for config file config.json

Multimodal Pretraining Tokenizer

This repository is configured as a pretraining tokenizer.

Tokenizer-level behavior:

  • BOS is disabled: bos_token = None, bos_token_id = None.
  • EOS/EOD is </s> with token id 2.
  • generation_config.json uses eos_token_id = 2.
  • Plain tokenization does not automatically add BOS or EOS, even with add_special_tokens=True.
  • The pretraining chat template appends </s> after assistant content.
  • chat_template.jinja is present and is the template file Transformers prioritizes over the JSON chat_template.

<|im_end|> remains in the vocabulary as token id 11, but it is not the EOS/EOD token for this pretraining tokenizer.

Raw Tokenization

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("sanjeevnv/multimodal-pretraining", use_fast=True)

tok.encode("Hello World", add_special_tokens=False)
# [22177, 5325]

tok.encode("Hello World", add_special_tokens=True)
# [22177, 5325]

tok.convert_ids_to_tokens([22177, 5325])
# ["Hello", "ĠWorld"]

To include EOD/EOS in raw text, include </s> explicitly:

tok.encode("Hello World</s>", add_special_tokens=True)
# [22177, 5325, 2]

tok.convert_ids_to_tokens([22177, 5325, 2])
# ["Hello", "ĠWorld", "</s>"]

Messages Template

For messages-style formatting, use apply_chat_template. The current template is a pretraining question/answer format, not a ChatML post-training format.

messages = [
    {"role": "system", "content": "You are concise."},
    {"role": "user", "content": "Hello World"},
    {"role": "assistant", "content": "Hi."},
]

rendered = tok.apply_chat_template(messages, tokenize=False)
print(rendered)

Rendered text:

You are concise.
question: Hello World
answer: Hi.</s>

Tokenized output:

ids = tok.apply_chat_template(messages, tokenize=True)
ids
# [4568, 1584, 104335, 1626, 23653, 1058, 45383, 5325, 1010, 24613, 1058, 24665, 1046, 2]

tok.convert_ids_to_tokens(ids)
# ["You", "Ġare", "Ġconcise", ".Ċ", "question", ":", "ĠHello", "ĠWorld", "Ċ", "answer", ":", "ĠHi", ".", "</s>"]

With return_assistant_tokens_mask=True, the assistant content and </s> are marked as assistant tokens:

encoded = tok.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_assistant_tokens_mask=True,
)

encoded["assistant_masks"]
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]
Downloads last month
57
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support