File size: 3,752 Bytes
23d80f6 326bf6c 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 23d80f6 a91a763 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | ---
license: apache-2.0
language: zh
tags:
- transformer
- t5
- text2text-generation
- chinese
- multitask
- tokenizer
---
# Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON
This repository hosts a modified version of the [IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese](https://huggingface.co/IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese) model. The primary purpose of this repository is to **include the `tokenizer.json` file**, which was missing in the original release.
## Motivation for this Repository
The original `IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese` model is an excellent T5-based model for various Chinese NLP tasks. However, it was released with only a `spiece.model` file for its tokenizer, lacking the `tokenizer.json` file.
While the Python `transformers` library can generally load the tokenizer from `spiece.model`, this absence caused issues for environments that strictly prefer or require `tokenizer.json` (e.g., certain versions or implementations of the Rust `tokenizers` library, or other frameworks that rely on this standardized format).
To enhance usability and compatibility across different platforms and libraries, this repository was created to provide the model with the commonly expected `tokenizer.json` file.
## Changes Made
The following modifications have been made to the original `IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese` model files:
* **Added `tokenizer.json`:** The primary change is the inclusion of the `tokenizer.json` file, generated from the original `spiece.model` using the Python `transformers` library's `save_pretrained()` method. This ensures broader compatibility and easier loading for various applications.
* **No Model Weights Changes:** **Crucially, the model weights (`pytorch_model.bin` or `model.safetensors`) themselves have not been altered in any way.** This repository provides the exact same powerful pre-trained model, just with an updated tokenizer serialization format.
## How to Use
You can load this model and its tokenizer using the Hugging Face `transformers` library:
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "your-username/Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON" # Replace with your actual repository name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
text = "你好,这是一个测试。"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
````
For Rust users (and others requiring `tokenizer.json`):
```rust
use tokenizers::Tokenizer;
use std::error::Error;
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
let model_id = "your-username/Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON"; // Replace with your actual repository name
// The Tokenizer::from_pretrained will now find and use tokenizer.json
let tokenizer = Tokenizer::from_pretrained(model_id, None).await?;
let text = "你好,这是一个中文文本。";
let encoding = tokenizer.encode(text, true).unwrap();
println!("Original text: {}", text);
println!("Tokens: {:?}", encoding.get_tokens());
println!("IDs: {:?}", encoding.get_ids());
let decoded_text = tokenizer.decode(encoding.get_ids(), true).unwrap();
println!("Decoded text: {}", decoded_text);
Ok(())
}
```
## Original Model Information
For more details about the original `IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese` model, its training, capabilities, and benchmarks, please refer to its official repository: [IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese](https://huggingface.co/IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese).
|