TFM-tokenizer / README.md

JingzeShi

Update README.md

6108270 verified about 1 month ago

preview code

raw

history blame contribute delete

2.02 kB

metadata

library_name: transformers
datasets:
  - SmallDoge/SmallCorpus

TFM-tokenizer

TFM-tokenizer is trained based on SmallCorpus, supporting table understanding, document retrieval, tool invocation, and reasoning.

This tokenizer was trained on 2M samples from:

Web-EN 50%
Web-ZH 20%
TextBook-EN 15%
TextBook-ZH 5%
Code 5%
Math 5%

How to use

With text

from tfm.modeling.tokenization_tfm_fast import TFMTokenizerFast

tokenizer = TFMTokenizerFast.from_pretrained("DIAL-TFM/TFM-tokenizer")

text = "I am TFM, a table foundation model."

input_ids = tokenizer([text], return_tensors="pt")
print(input_ids)

{
    'input_ids': tensor([[128000,     40,   1097,    350,  26691,     11,    264,   2007,  16665,   1646,     13]]),
    'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}

With table

from tfm.modeling.tokenization_tfm_fast import TFMTokenizerFast

tokenizer = TFMTokenizerFast.from_pretrained("DIAL-TFM/TFM-tokenizer")

table = [
    ["Name", "Age", "City"],
    ["Jingze", "21", "Guangzhou"],
]

input_ids = tokenizer.batch_process_tables([table])
print(input_ids)

{
    'input_ids': tensor([[  678, 17166, 13020,    41,   287,  3059,  1691, 17198,   526, 52865]]),
    'row_ids': tensor([[0, 0, 0, 1, 1, 1, 1, 1, 1, 1]]),
    'col_ids': tensor([[0, 1, 2, 0, 0, 0, 1, 2, 2, 2]]),
    'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}

With conversation

With documents

With tools

With reasoning

With documents tools reasoning