File size: 2,016 Bytes
75cdaf0 1741cf2 75cdaf0 1741cf2 75cdaf0 6108270 1741cf2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
---
library_name: transformers
datasets:
- SmallDoge/SmallCorpus
---
# TFM-tokenizer
TFM-tokenizer is trained based on [SmallCorpus](https://huggingface.co/datasets/SmallDoge/SmallCorpus), supporting table understanding, document retrieval, tool invocation, and reasoning.
This tokenizer was trained on 2M samples from:
- Web-EN 50%
- Web-ZH 20%
- TextBook-EN 15%
- TextBook-ZH 5%
- Code 5%
- Math 5%
## How to use
<details>
<summary>With text</summary>
```python
from tfm.modeling.tokenization_tfm_fast import TFMTokenizerFast
tokenizer = TFMTokenizerFast.from_pretrained("DIAL-TFM/TFM-tokenizer")
text = "I am TFM, a table foundation model."
input_ids = tokenizer([text], return_tensors="pt")
print(input_ids)
```
```shell
{
'input_ids': tensor([[128000, 40, 1097, 350, 26691, 11, 264, 2007, 16665, 1646, 13]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}
```
</details>
<details>
<summary>With table</summary>
```python
from tfm.modeling.tokenization_tfm_fast import TFMTokenizerFast
tokenizer = TFMTokenizerFast.from_pretrained("DIAL-TFM/TFM-tokenizer")
table = [
["Name", "Age", "City"],
["Jingze", "21", "Guangzhou"],
]
input_ids = tokenizer.batch_process_tables([table])
print(input_ids)
```
```shell
{
'input_ids': tensor([[ 678, 17166, 13020, 41, 287, 3059, 1691, 17198, 526, 52865]]),
'row_ids': tensor([[0, 0, 0, 1, 1, 1, 1, 1, 1, 1]]),
'col_ids': tensor([[0, 1, 2, 0, 0, 0, 1, 2, 2, 2]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}
```
</details>
<details>
<summary>With conversation</summary>
```python
```
```shell
```
</details>
<details>
<summary>With documents</summary>
```python
```
```shell
```
</details>
<details>
<summary>With tools</summary>
```python
```
```shell
```
</details>
<details>
<summary>With reasoning</summary>
```python
```
```shell
```
</details>
<details>
<summary>With documents tools reasoning</summary>
```python
```
```shell
```
</details> |