--- library_name: transformers datasets: - SmallDoge/SmallCorpus --- # TFM-tokenizer TFM-tokenizer is trained based on [SmallCorpus](https://huggingface.co/datasets/SmallDoge/SmallCorpus), supporting table understanding, document retrieval, tool invocation, and reasoning. This tokenizer was trained on 2M samples from: - Web-EN 50% - Web-ZH 20% - TextBook-EN 15% - TextBook-ZH 5% - Code 5% - Math 5% ## How to use
With text ```python from tfm.modeling.tokenization_tfm_fast import TFMTokenizerFast tokenizer = TFMTokenizerFast.from_pretrained("DIAL-TFM/TFM-tokenizer") text = "I am TFM, a table foundation model." input_ids = tokenizer([text], return_tensors="pt") print(input_ids) ``` ```shell { 'input_ids': tensor([[128000, 40, 1097, 350, 26691, 11, 264, 2007, 16665, 1646, 13]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]) } ```
With table ```python from tfm.modeling.tokenization_tfm_fast import TFMTokenizerFast tokenizer = TFMTokenizerFast.from_pretrained("DIAL-TFM/TFM-tokenizer") table = [ ["Name", "Age", "City"], ["Jingze", "21", "Guangzhou"], ] input_ids = tokenizer.batch_process_tables([table]) print(input_ids) ``` ```shell { 'input_ids': tensor([[ 678, 17166, 13020, 41, 287, 3059, 1691, 17198, 526, 52865]]), 'row_ids': tensor([[0, 0, 0, 1, 1, 1, 1, 1, 1, 1]]), 'col_ids': tensor([[0, 1, 2, 0, 0, 0, 1, 2, 2, 2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]) } ```
With conversation ```python ``` ```shell ```
With documents ```python ``` ```shell ```
With tools ```python ``` ```shell ```
With reasoning ```python ``` ```shell ```
With documents tools reasoning ```python ``` ```shell ```