File size: 3,752 Bytes
23d80f6
326bf6c
 
23d80f6
a91a763
 
 
23d80f6
a91a763
 
23d80f6
 
a91a763
23d80f6
a91a763
23d80f6
a91a763
23d80f6
a91a763
23d80f6
a91a763
23d80f6
a91a763
23d80f6
a91a763
23d80f6
a91a763
23d80f6
a91a763
 
23d80f6
a91a763
23d80f6
a91a763
23d80f6
 
a91a763
23d80f6
a91a763
23d80f6
a91a763
 
23d80f6
a91a763
 
 
 
 
23d80f6
a91a763
23d80f6
a91a763
 
 
23d80f6
a91a763
 
 
23d80f6
a91a763
 
23d80f6
a91a763
 
23d80f6
a91a763
 
 
23d80f6
a91a763
 
23d80f6
a91a763
23d80f6
 
 
a91a763
23d80f6
a91a763
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
license: apache-2.0
language: zh
tags:
- transformer
- t5
- text2text-generation
- chinese
- multitask
- tokenizer
---

# Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON

This repository hosts a modified version of the [IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese](https://huggingface.co/IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese) model. The primary purpose of this repository is to **include the `tokenizer.json` file**, which was missing in the original release.

## Motivation for this Repository

The original `IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese` model is an excellent T5-based model for various Chinese NLP tasks. However, it was released with only a `spiece.model` file for its tokenizer, lacking the `tokenizer.json` file.

While the Python `transformers` library can generally load the tokenizer from `spiece.model`, this absence caused issues for environments that strictly prefer or require `tokenizer.json` (e.g., certain versions or implementations of the Rust `tokenizers` library, or other frameworks that rely on this standardized format).

To enhance usability and compatibility across different platforms and libraries, this repository was created to provide the model with the commonly expected `tokenizer.json` file.

## Changes Made

The following modifications have been made to the original `IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese` model files:

* **Added `tokenizer.json`:** The primary change is the inclusion of the `tokenizer.json` file, generated from the original `spiece.model` using the Python `transformers` library's `save_pretrained()` method. This ensures broader compatibility and easier loading for various applications.
* **No Model Weights Changes:** **Crucially, the model weights (`pytorch_model.bin` or `model.safetensors`) themselves have not been altered in any way.** This repository provides the exact same powerful pre-trained model, just with an updated tokenizer serialization format.

## How to Use

You can load this model and its tokenizer using the Hugging Face `transformers` library:

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "your-username/Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON" # Replace with your actual repository name

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = "你好,这是一个测试。"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
````

For Rust users (and others requiring `tokenizer.json`):

```rust
use tokenizers::Tokenizer;
use std::error::Error;

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let model_id = "your-username/Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON"; // Replace with your actual repository name
    
    // The Tokenizer::from_pretrained will now find and use tokenizer.json
    let tokenizer = Tokenizer::from_pretrained(model_id, None).await?; 

    let text = "你好,这是一个中文文本。";
    let encoding = tokenizer.encode(text, true).unwrap();

    println!("Original text: {}", text);
    println!("Tokens: {:?}", encoding.get_tokens());
    println!("IDs: {:?}", encoding.get_ids());

    let decoded_text = tokenizer.decode(encoding.get_ids(), true).unwrap();
    println!("Decoded text: {}", decoded_text);

    Ok(())
}
```

## Original Model Information

For more details about the original `IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese` model, its training, capabilities, and benchmarks, please refer to its official repository: [IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese](https://huggingface.co/IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese).