Noctiro's picture
Update README.md
326bf6c verified
---
license: apache-2.0
language: zh
tags:
- transformer
- t5
- text2text-generation
- chinese
- multitask
- tokenizer
---
# Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON
This repository hosts a modified version of the [IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese](https://huggingface.co/IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese) model. The primary purpose of this repository is to **include the `tokenizer.json` file**, which was missing in the original release.
## Motivation for this Repository
The original `IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese` model is an excellent T5-based model for various Chinese NLP tasks. However, it was released with only a `spiece.model` file for its tokenizer, lacking the `tokenizer.json` file.
While the Python `transformers` library can generally load the tokenizer from `spiece.model`, this absence caused issues for environments that strictly prefer or require `tokenizer.json` (e.g., certain versions or implementations of the Rust `tokenizers` library, or other frameworks that rely on this standardized format).
To enhance usability and compatibility across different platforms and libraries, this repository was created to provide the model with the commonly expected `tokenizer.json` file.
## Changes Made
The following modifications have been made to the original `IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese` model files:
* **Added `tokenizer.json`:** The primary change is the inclusion of the `tokenizer.json` file, generated from the original `spiece.model` using the Python `transformers` library's `save_pretrained()` method. This ensures broader compatibility and easier loading for various applications.
* **No Model Weights Changes:** **Crucially, the model weights (`pytorch_model.bin` or `model.safetensors`) themselves have not been altered in any way.** This repository provides the exact same powerful pre-trained model, just with an updated tokenizer serialization format.
## How to Use
You can load this model and its tokenizer using the Hugging Face `transformers` library:
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "your-username/Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON" # Replace with your actual repository name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
text = "你好,这是一个测试。"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
````
For Rust users (and others requiring `tokenizer.json`):
```rust
use tokenizers::Tokenizer;
use std::error::Error;
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
let model_id = "your-username/Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON"; // Replace with your actual repository name
// The Tokenizer::from_pretrained will now find and use tokenizer.json
let tokenizer = Tokenizer::from_pretrained(model_id, None).await?;
let text = "你好,这是一个中文文本。";
let encoding = tokenizer.encode(text, true).unwrap();
println!("Original text: {}", text);
println!("Tokens: {:?}", encoding.get_tokens());
println!("IDs: {:?}", encoding.get_ids());
let decoded_text = tokenizer.decode(encoding.get_ids(), true).unwrap();
println!("Decoded text: {}", decoded_text);
Ok(())
}
```
## Original Model Information
For more details about the original `IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese` model, its training, capabilities, and benchmarks, please refer to its official repository: [IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese](https://huggingface.co/IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese).