Improve language tag

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show

README.md +99 -87

README.md CHANGED Viewed

@@ -1,88 +1,100 @@
----
-license: apache-2.0
-datasets:
-- TigerResearch/pretrain_zh
-language:
-- zh
-base_model:
-- Qwen/Qwen2.5-3B
-tags:
-- qwen2.5
-- text-generation-inference
-- Text Generation
-- Character
----
-**Qwen2.5-3B-Character**
-**Introduction:**
-**Qwen2.5-3B-Character** is the Character version of [Qwen2.5-3B](https://huggingface.co/Qwen/Qwen2.5-3B) model.  It is developed based on the [Qwen2.5-3B](https://huggingface.co/Qwen/Qwen2.5-3B) model. It is specifically designed for character-to-character transformation and generation tasks.
-**Core Contributions:**
-1. **Modified Token Vocabulary:** The original model's token vocabulary has been revised to remove tokens representing phrases and multiple characters. This refinement enhances the model's focus on individual character processing.
-2. **Continued Pre-training:** Based on the modified vocabulary, the model has undergone further pre-training to optimize its performance and adaptability for character-level tasks.
-**Training Dataset:**
-The model has been trained using the `TigerResearch/pretrain_zh` dataset, a comprehensive Chinese pre-training dataset provided by **TigerResearch**. For more information about the dataset, please visit: [TigerResearch/pretrain_zh](https://huggingface.co/datasets/TigerResearch/pretrain_zh).
-**Training Code:**
-The training process for this model was facilitated by the **LLaMA-Factory**, an open-source project that provides tools and frameworks for training language models. The LLaMa-factory codebase is available at: [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory).
-**Results**
-To assess the efficacy of the Qwen2.5-3B-Character, we evaluated its performance on three widely utilized benchmarks: C-Evel, CMMLU, and MMLU. The results are tabulated as follows:
-| Model                             | ceval| cmmlu| mmlu|
-| :---                              | :---: | :---: | :---: |
-| Qwen2.5-3B                        | 74.37| 74.94| 65.87 |
-| Qwen2.5-3B-filter 		        | 70.43| 69.69| 65.53 |
-| Qwen2.5-3B-Character     	        | 71.97| 71.94| 65.18 |
-In the table, to discern the model performance more distinctly, we have presented the test results for both the original Qwen2.5-3B (Qwen2.5-3B) and the token-modified Qwen2.5-3B (Qwen2.5-3B-filter).
-**Quickstart**
-The latest version of transformers is recommended (at least 4.37.0). Here we show a code snippet to show you how to use the chat model with transformers:
-```shell
-from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
-model_name = 'Henry94/Qwen2.5-3B-Character'
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
-prompt = "请简单介绍一下大型语言模型."
-messages = [
-    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
-    {"role": "user", "content": prompt}
-]
-text = tokenizer.apply_chat_template(
-    messages,
-    tokenize=False,
-    add_generation_prompt=True
-)
-model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
-generated_ids = model.generate(
-    **model_inputs,
-    max_new_tokens=512
-)
-generated_ids = [
-    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
-]
-response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
-print(response)
 ```

+---
+license: apache-2.0
+datasets:
+- TigerResearch/pretrain_zh
+language:
+- zho
+- eng
+- fra
+- spa
+- por
+- deu
+- ita
+- rus
+- jpn
+- kor
+- vie
+- tha
+- ara
+base_model:
+- Qwen/Qwen2.5-3B
+tags:
+- qwen2.5
+- text-generation-inference
+- Text Generation
+- Character
+---
+**Qwen2.5-3B-Character**
+**Introduction:**
+**Qwen2.5-3B-Character** is the Character version of [Qwen2.5-3B](https://huggingface.co/Qwen/Qwen2.5-3B) model.  It is developed based on the [Qwen2.5-3B](https://huggingface.co/Qwen/Qwen2.5-3B) model. It is specifically designed for character-to-character transformation and generation tasks.
+**Core Contributions:**
+1. **Modified Token Vocabulary:** The original model's token vocabulary has been revised to remove tokens representing phrases and multiple characters. This refinement enhances the model's focus on individual character processing.
+2. **Continued Pre-training:** Based on the modified vocabulary, the model has undergone further pre-training to optimize its performance and adaptability for character-level tasks.
+**Training Dataset:**
+The model has been trained using the `TigerResearch/pretrain_zh` dataset, a comprehensive Chinese pre-training dataset provided by **TigerResearch**. For more information about the dataset, please visit: [TigerResearch/pretrain_zh](https://huggingface.co/datasets/TigerResearch/pretrain_zh).
+**Training Code:**
+The training process for this model was facilitated by the **LLaMA-Factory**, an open-source project that provides tools and frameworks for training language models. The LLaMa-factory codebase is available at: [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory).
+**Results**
+To assess the efficacy of the Qwen2.5-3B-Character, we evaluated its performance on three widely utilized benchmarks: C-Evel, CMMLU, and MMLU. The results are tabulated as follows:
+| Model                             | ceval| cmmlu| mmlu|
+| :---                              | :---: | :---: | :---: |
+| Qwen2.5-3B                        | 74.37| 74.94| 65.87 |
+| Qwen2.5-3B-filter 		        | 70.43| 69.69| 65.53 |
+| Qwen2.5-3B-Character     	        | 71.97| 71.94| 65.18 |
+In the table, to discern the model performance more distinctly, we have presented the test results for both the original Qwen2.5-3B (Qwen2.5-3B) and the token-modified Qwen2.5-3B (Qwen2.5-3B-filter).
+**Quickstart**
+The latest version of transformers is recommended (at least 4.37.0). Here we show a code snippet to show you how to use the chat model with transformers:
+```shell
+from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
+model_name = 'Henry94/Qwen2.5-3B-Character'
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
+prompt = "请简单介绍一下大型语言模型."
+messages = [
+    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=512
+)
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+print(response)
 ```