|
|
--- |
|
|
library_name: transformers |
|
|
datasets: |
|
|
- SmallDoge/SmallCorpus |
|
|
--- |
|
|
|
|
|
# TFM-tokenizer |
|
|
|
|
|
TFM-tokenizer is trained based on [SmallCorpus](https://huggingface.co/datasets/SmallDoge/SmallCorpus), supporting table understanding, document retrieval, tool invocation, and reasoning. |
|
|
|
|
|
This tokenizer was trained on 2M samples from: |
|
|
- Web-EN 50% |
|
|
- Web-ZH 20% |
|
|
- TextBook-EN 15% |
|
|
- TextBook-ZH 5% |
|
|
- Code 5% |
|
|
- Math 5% |
|
|
|
|
|
## How to use |
|
|
|
|
|
<details> |
|
|
<summary>With text</summary> |
|
|
|
|
|
```python |
|
|
from tfm.modeling.tokenization_tfm_fast import TFMTokenizerFast |
|
|
|
|
|
tokenizer = TFMTokenizerFast.from_pretrained("DIAL-TFM/TFM-tokenizer") |
|
|
|
|
|
text = "I am TFM, a table foundation model." |
|
|
|
|
|
input_ids = tokenizer([text], return_tensors="pt") |
|
|
print(input_ids) |
|
|
``` |
|
|
```shell |
|
|
{ |
|
|
'input_ids': tensor([[128000, 40, 1097, 350, 26691, 11, 264, 2007, 16665, 1646, 13]]), |
|
|
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]) |
|
|
} |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>With table</summary> |
|
|
|
|
|
```python |
|
|
from tfm.modeling.tokenization_tfm_fast import TFMTokenizerFast |
|
|
|
|
|
tokenizer = TFMTokenizerFast.from_pretrained("DIAL-TFM/TFM-tokenizer") |
|
|
|
|
|
table = [ |
|
|
["Name", "Age", "City"], |
|
|
["Jingze", "21", "Guangzhou"], |
|
|
] |
|
|
|
|
|
input_ids = tokenizer.batch_process_tables([table]) |
|
|
print(input_ids) |
|
|
``` |
|
|
```shell |
|
|
{ |
|
|
'input_ids': tensor([[ 678, 17166, 13020, 41, 287, 3059, 1691, 17198, 526, 52865]]), |
|
|
'row_ids': tensor([[0, 0, 0, 1, 1, 1, 1, 1, 1, 1]]), |
|
|
'col_ids': tensor([[0, 1, 2, 0, 0, 0, 1, 2, 2, 2]]), |
|
|
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]) |
|
|
} |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>With conversation</summary> |
|
|
|
|
|
```python |
|
|
``` |
|
|
```shell |
|
|
``` |
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>With documents</summary> |
|
|
|
|
|
```python |
|
|
``` |
|
|
```shell |
|
|
``` |
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>With tools</summary> |
|
|
|
|
|
```python |
|
|
``` |
|
|
```shell |
|
|
``` |
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>With reasoning</summary> |
|
|
|
|
|
```python |
|
|
``` |
|
|
```shell |
|
|
``` |
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>With documents tools reasoning</summary> |
|
|
|
|
|
```python |
|
|
``` |
|
|
```shell |
|
|
``` |
|
|
</details> |