Transformers
TFM-tokenizer / README.md
JingzeShi's picture
Update README.md
6108270 verified
metadata
library_name: transformers
datasets:
  - SmallDoge/SmallCorpus

TFM-tokenizer

TFM-tokenizer is trained based on SmallCorpus, supporting table understanding, document retrieval, tool invocation, and reasoning.

This tokenizer was trained on 2M samples from:

  • Web-EN 50%
  • Web-ZH 20%
  • TextBook-EN 15%
  • TextBook-ZH 5%
  • Code 5%
  • Math 5%

How to use

With text
from tfm.modeling.tokenization_tfm_fast import TFMTokenizerFast

tokenizer = TFMTokenizerFast.from_pretrained("DIAL-TFM/TFM-tokenizer")

text = "I am TFM, a table foundation model."

input_ids = tokenizer([text], return_tensors="pt")
print(input_ids)
{
    'input_ids': tensor([[128000,     40,   1097,    350,  26691,     11,    264,   2007,  16665,   1646,     13]]),
    'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}
With table
from tfm.modeling.tokenization_tfm_fast import TFMTokenizerFast

tokenizer = TFMTokenizerFast.from_pretrained("DIAL-TFM/TFM-tokenizer")

table = [
    ["Name", "Age", "City"],
    ["Jingze", "21", "Guangzhou"],
]

input_ids = tokenizer.batch_process_tables([table])
print(input_ids)
{
    'input_ids': tensor([[  678, 17166, 13020,    41,   287,  3059,  1691, 17198,   526, 52865]]),
    'row_ids': tensor([[0, 0, 0, 1, 1, 1, 1, 1, 1, 1]]),
    'col_ids': tensor([[0, 1, 2, 0, 0, 0, 1, 2, 2, 2]]),
    'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}
With conversation


With documents


With tools


With reasoning


With documents tools reasoning