|
|
--- |
|
|
license: apache-2.0 |
|
|
language: zh |
|
|
tags: |
|
|
- transformer |
|
|
- t5 |
|
|
- text2text-generation |
|
|
- chinese |
|
|
- multitask |
|
|
- tokenizer |
|
|
--- |
|
|
|
|
|
# Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON |
|
|
|
|
|
This repository hosts a modified version of the [IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese](https://huggingface.co/IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese) model. The primary purpose of this repository is to **include the `tokenizer.json` file**, which was missing in the original release. |
|
|
|
|
|
## Motivation for this Repository |
|
|
|
|
|
The original `IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese` model is an excellent T5-based model for various Chinese NLP tasks. However, it was released with only a `spiece.model` file for its tokenizer, lacking the `tokenizer.json` file. |
|
|
|
|
|
While the Python `transformers` library can generally load the tokenizer from `spiece.model`, this absence caused issues for environments that strictly prefer or require `tokenizer.json` (e.g., certain versions or implementations of the Rust `tokenizers` library, or other frameworks that rely on this standardized format). |
|
|
|
|
|
To enhance usability and compatibility across different platforms and libraries, this repository was created to provide the model with the commonly expected `tokenizer.json` file. |
|
|
|
|
|
## Changes Made |
|
|
|
|
|
The following modifications have been made to the original `IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese` model files: |
|
|
|
|
|
* **Added `tokenizer.json`:** The primary change is the inclusion of the `tokenizer.json` file, generated from the original `spiece.model` using the Python `transformers` library's `save_pretrained()` method. This ensures broader compatibility and easier loading for various applications. |
|
|
* **No Model Weights Changes:** **Crucially, the model weights (`pytorch_model.bin` or `model.safetensors`) themselves have not been altered in any way.** This repository provides the exact same powerful pre-trained model, just with an updated tokenizer serialization format. |
|
|
|
|
|
## How to Use |
|
|
|
|
|
You can load this model and its tokenizer using the Hugging Face `transformers` library: |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
|
|
model_name = "your-username/Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON" # Replace with your actual repository name |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) |
|
|
|
|
|
text = "你好,这是一个测试。" |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
outputs = model.generate(**inputs) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
```` |
|
|
|
|
|
For Rust users (and others requiring `tokenizer.json`): |
|
|
|
|
|
```rust |
|
|
use tokenizers::Tokenizer; |
|
|
use std::error::Error; |
|
|
|
|
|
#[tokio::main] |
|
|
async fn main() -> Result<(), Box<dyn Error>> { |
|
|
let model_id = "your-username/Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON"; // Replace with your actual repository name |
|
|
|
|
|
// The Tokenizer::from_pretrained will now find and use tokenizer.json |
|
|
let tokenizer = Tokenizer::from_pretrained(model_id, None).await?; |
|
|
|
|
|
let text = "你好,这是一个中文文本。"; |
|
|
let encoding = tokenizer.encode(text, true).unwrap(); |
|
|
|
|
|
println!("Original text: {}", text); |
|
|
println!("Tokens: {:?}", encoding.get_tokens()); |
|
|
println!("IDs: {:?}", encoding.get_ids()); |
|
|
|
|
|
let decoded_text = tokenizer.decode(encoding.get_ids(), true).unwrap(); |
|
|
println!("Decoded text: {}", decoded_text); |
|
|
|
|
|
Ok(()) |
|
|
} |
|
|
``` |
|
|
|
|
|
## Original Model Information |
|
|
|
|
|
For more details about the original `IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese` model, its training, capabilities, and benchmarks, please refer to its official repository: [IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese](https://huggingface.co/IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese). |
|
|
|