--- license: apache-2.0 language: zh tags: - transformer - t5 - text2text-generation - chinese - multitask - tokenizer --- # Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON This repository hosts a modified version of the [IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese](https://huggingface.co/IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese) model. The primary purpose of this repository is to **include the `tokenizer.json` file**, which was missing in the original release. ## Motivation for this Repository The original `IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese` model is an excellent T5-based model for various Chinese NLP tasks. However, it was released with only a `spiece.model` file for its tokenizer, lacking the `tokenizer.json` file. While the Python `transformers` library can generally load the tokenizer from `spiece.model`, this absence caused issues for environments that strictly prefer or require `tokenizer.json` (e.g., certain versions or implementations of the Rust `tokenizers` library, or other frameworks that rely on this standardized format). To enhance usability and compatibility across different platforms and libraries, this repository was created to provide the model with the commonly expected `tokenizer.json` file. ## Changes Made The following modifications have been made to the original `IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese` model files: * **Added `tokenizer.json`:** The primary change is the inclusion of the `tokenizer.json` file, generated from the original `spiece.model` using the Python `transformers` library's `save_pretrained()` method. This ensures broader compatibility and easier loading for various applications. * **No Model Weights Changes:** **Crucially, the model weights (`pytorch_model.bin` or `model.safetensors`) themselves have not been altered in any way.** This repository provides the exact same powerful pre-trained model, just with an updated tokenizer serialization format. ## How to Use You can load this model and its tokenizer using the Hugging Face `transformers` library: ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model_name = "your-username/Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON" # Replace with your actual repository name tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) text = "你好,这是一个测试。" inputs = tokenizer(text, return_tensors="pt") outputs = model.generate(**inputs) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ```` For Rust users (and others requiring `tokenizer.json`): ```rust use tokenizers::Tokenizer; use std::error::Error; #[tokio::main] async fn main() -> Result<(), Box> { let model_id = "your-username/Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON"; // Replace with your actual repository name // The Tokenizer::from_pretrained will now find and use tokenizer.json let tokenizer = Tokenizer::from_pretrained(model_id, None).await?; let text = "你好,这是一个中文文本。"; let encoding = tokenizer.encode(text, true).unwrap(); println!("Original text: {}", text); println!("Tokens: {:?}", encoding.get_tokens()); println!("IDs: {:?}", encoding.get_ids()); let decoded_text = tokenizer.decode(encoding.get_ids(), true).unwrap(); println!("Decoded text: {}", decoded_text); Ok(()) } ``` ## Original Model Information For more details about the original `IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese` model, its training, capabilities, and benchmarks, please refer to its official repository: [IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese](https://huggingface.co/IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese).