File size: 2,016 Bytes

75cdaf0
 
1741cf2
 
75cdaf0
 
1741cf2
75cdaf0
6108270
 
1741cf2

---
library_name: transformers
datasets:
- SmallDoge/SmallCorpus
---

# TFM-tokenizer

TFM-tokenizer is trained based on [SmallCorpus](https://huggingface.co/datasets/SmallDoge/SmallCorpus), supporting table understanding, document retrieval, tool invocation, and reasoning.
 
This tokenizer was trained on 2M samples from:
- Web-EN 50%
- Web-ZH 20%
- TextBook-EN 15%
- TextBook-ZH 5%
- Code 5%
- Math 5%

## How to use

<details>
<summary>With text</summary>

```python
from tfm.modeling.tokenization_tfm_fast import TFMTokenizerFast

tokenizer = TFMTokenizerFast.from_pretrained("DIAL-TFM/TFM-tokenizer")

text = "I am TFM, a table foundation model."

input_ids = tokenizer([text], return_tensors="pt")
print(input_ids)
```
```shell
{
    'input_ids': tensor([[128000,     40,   1097,    350,  26691,     11,    264,   2007,  16665,   1646,     13]]),
    'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}
```

</details>

<details>
<summary>With table</summary>

```python
from tfm.modeling.tokenization_tfm_fast import TFMTokenizerFast

tokenizer = TFMTokenizerFast.from_pretrained("DIAL-TFM/TFM-tokenizer")

table = [
    ["Name", "Age", "City"],
    ["Jingze", "21", "Guangzhou"],
]

input_ids = tokenizer.batch_process_tables([table])
print(input_ids)
```
```shell
{
    'input_ids': tensor([[  678, 17166, 13020,    41,   287,  3059,  1691, 17198,   526, 52865]]),
    'row_ids': tensor([[0, 0, 0, 1, 1, 1, 1, 1, 1, 1]]),
    'col_ids': tensor([[0, 1, 2, 0, 0, 0, 1, 2, 2, 2]]),
    'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}
```

</details>

<details>
<summary>With conversation</summary>

```python
```
```shell
```
</details>

<details>
<summary>With documents</summary>

```python
```
```shell
```
</details>

<details>
<summary>With tools</summary>

```python
```
```shell
```
</details>

<details>
<summary>With reasoning</summary>

```python
```
```shell
```
</details>

<details>
<summary>With documents tools reasoning</summary>

```python
```
```shell
```
</details>