Update README.md

326bf6c verified 8 months ago

3.75 kB

	---
	license: apache-2.0
	language: zh
	tags:
	- transformer
	- t5
	- text2text-generation
	- chinese
	- multitask
	- tokenizer
	---

	# Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON

	This repository hosts a modified version of the [IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese](https://huggingface.co/IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese) model. The primary purpose of this repository is to include the `tokenizer.json` file, which was missing in the original release.

	## Motivation for this Repository

	The original `IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese` model is an excellent T5-based model for various Chinese NLP tasks. However, it was released with only a `spiece.model` file for its tokenizer, lacking the `tokenizer.json` file.

	While the Python `transformers` library can generally load the tokenizer from `spiece.model`, this absence caused issues for environments that strictly prefer or require `tokenizer.json` (e.g., certain versions or implementations of the Rust `tokenizers` library, or other frameworks that rely on this standardized format).

	To enhance usability and compatibility across different platforms and libraries, this repository was created to provide the model with the commonly expected `tokenizer.json` file.

	## Changes Made

	The following modifications have been made to the original `IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese` model files:

	* Added `tokenizer.json`: The primary change is the inclusion of the `tokenizer.json` file, generated from the original `spiece.model` using the Python `transformers` library's `save_pretrained()` method. This ensures broader compatibility and easier loading for various applications.
	* No Model Weights Changes: Crucially, the model weights (`pytorch_model.bin` or `model.safetensors`) themselves have not been altered in any way. This repository provides the exact same powerful pre-trained model, just with an updated tokenizer serialization format.

	## How to Use

	You can load this model and its tokenizer using the Hugging Face `transformers` library:

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	model_name = "your-username/Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON" # Replace with your actual repository name

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

	text = "你好，这是一个测试。"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model.generate(**inputs)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	````

	For Rust users (and others requiring `tokenizer.json`):

	```rust
	use tokenizers::Tokenizer;
	use std::error::Error;

	#[tokio::main]
	async fn main() -> Result<(), Box<dyn Error>> {
	let model_id = "your-username/Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON"; // Replace with your actual repository name

	// The Tokenizer::from_pretrained will now find and use tokenizer.json
	let tokenizer = Tokenizer::from_pretrained(model_id, None).await?;

	let text = "你好，这是一个中文文本。";
	let encoding = tokenizer.encode(text, true).unwrap();

	println!("Original text: {}", text);
	println!("Tokens: {:?}", encoding.get_tokens());
	println!("IDs: {:?}", encoding.get_ids());

	let decoded_text = tokenizer.decode(encoding.get_ids(), true).unwrap();
	println!("Decoded text: {}", decoded_text);

	Ok(())
	}
	```

	## Original Model Information

	For more details about the original `IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese` model, its training, capabilities, and benchmarks, please refer to its official repository: [IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese](https://huggingface.co/IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese).