Improve language tag

#1
by lbourdois - opened
Files changed (1) hide show
  1. README.md +129 -117
README.md CHANGED
@@ -1,118 +1,130 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - shellwork/ChatParts_Dataset
5
- language:
6
- - en
7
- base_model:
8
- - Qwen/Qwen2.5-14B-Instruct
9
- pipeline_tag: question-answering
10
- tags:
11
- - biology
12
- - medical
13
- ---
14
-
15
- # shellwork/ChatParts-qwen2.5-14b
16
-
17
- 🤖 [XJTLU-Software RAG GitHub Repository](https://github.com/shellwork/XJTLU-Software-RAG/tree/main) • 📊 [ChatParts Dataset](https://huggingface.co/datasets/shellwork/ChatParts_Dataset)
18
-
19
- **shellwork/ChatParts-qwen2.5-14b** is a specialized dialogue model fine-tuned from **Qwen2.5-14B-Instruct** by the XJTLU-Software iGEM Competition team. This model is tailored for the synthetic biology domain, aiming to assist competition participants and researchers in efficiently collecting and organizing relevant information. It serves as the local model component of the XJTLU-developed Retrieval-Augmented Generation (RAG) software, enhancing search and summarization capabilities within synthetic biology data.
20
-
21
- ## 📚 Dataset Information
22
-
23
- The model is trained on a comprehensive synthetic biology-specific dataset curated from multiple authoritative sources:
24
-
25
- - **iGEM Wiki Pages (2004-2023):** Comprehensive coverage of synthetic biology topics from over two decades of iGEM competitions.
26
- - **Synthetic Biology Review Papers:** More than 1,000 high-quality review articles providing in-depth insights into various aspects of synthetic biology.
27
- - **iGEM Parts Registry Documentation:** Detailed documentation of parts used in iGEM projects, facilitating accurate information retrieval.
28
-
29
- In total, the dataset comprises over **200,000 question-answer pairs**, meticulously assembled to cover a wide spectrum of synthetic biology topics. For more detailed information about the dataset, please visit our [training data repository](https://huggingface.co/datasets/shellwork/ChatParts_Dataset).
30
-
31
- ## 🛠️ How to Use
32
-
33
- This repository supports usage with the `transformers` library. Below is a straightforward example of how to deploy the **shellwork/ChatParts-qwen2.5-14b** model using `transformers`.
34
-
35
- ### 📋 Requirements
36
-
37
- - **Transformers Library:** Ensure you have `transformers` version **>= 4.43.0** installed. You can update your installation using:
38
-
39
- ```bash
40
- pip install --upgrade transformers
41
- ```
42
-
43
- ### ⚙️ Example: Deploying with Transformers
44
-
45
- ```python
46
- import torch
47
- from transformers import AutoModelForCausalLM, AutoTokenizer
48
-
49
- # Load the tokenizer and model
50
- model_name = "shellwork/ChatParts-qwen2.5-14b"
51
- model = AutoModelForCausalLM.from_pretrained(
52
- model_name,
53
- torch_dtype="auto",
54
- device_map="auto"
55
- )
56
- tokenizer = AutoTokenizer.from_pretrained(model_name)
57
-
58
- # Define the prompt and messages
59
- prompt = "Give me a short introduction to synthetic biology."
60
- messages = [
61
- {"role": "system", "content": "You are ChatParts, a model specialized in synthetic biology created by XJTLU-Software."},
62
- {"role": "user", "content": prompt}
63
- ]
64
-
65
- # Apply chat template
66
- text = tokenizer.apply_chat_template(
67
- messages,
68
- tokenize=False,
69
- add_generation_prompt=True
70
- )
71
-
72
- # Tokenize the input
73
- model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
74
-
75
- # Generate the response
76
- generated_ids = model.generate(
77
- **model_inputs,
78
- max_new_tokens=512
79
- )
80
-
81
- # Extract the generated tokens
82
- generated_ids = [
83
- output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
84
- ]
85
-
86
- # Decode the response
87
- response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
88
- print(response)
89
- ```
90
-
91
- ### 🔍 Explanation
92
-
93
- 1. **Import Libraries:** Import the necessary libraries including `torch`, `modelscope`, and `transformers`.
94
-
95
- 2. **Load Model and Tokenizer:** Use `AutoModelForCausalLM` and `AutoTokenizer` from `modelscope` to load the pre-trained model and tokenizer.
96
-
97
- 3. **Define Prompt and Messages:** Create a prompt and define the conversation messages, including system and user roles.
98
-
99
- 4. **Apply Chat Template:** Utilize the `apply_chat_template` method to format the messages appropriately for the model.
100
-
101
- 5. **Tokenize Input:** Tokenize the formatted text and move it to the appropriate device (CPU/GPU).
102
-
103
- 6. **Generate Response:** Use the `generate` method to produce a response with a specified maximum number of new tokens.
104
-
105
- 7. **Decode and Print:** Decode the generated tokens to obtain the final text response and print it.
106
-
107
- ## 📄 License
108
-
109
- This model is released under the **Apache License 2.0**. For more details, please refer to the [license information](https://github.com/shellwork/XJTLU-Software-RAG/tree/main) in the repository.
110
-
111
- ## 🔗 Additional Resources
112
-
113
- - **RAG Software:** Explore the full capabilities of our Retrieval-Augmented Generation software [here](https://github.com/shellwork/XJTLU-Software-RAG/tree/main).
114
- - **Training Data:** Access and review the extensive training dataset [here](https://huggingface.co/datasets/shellwork/ChatParts_Dataset).
115
- - **Support & Contributions:** For support or to contribute to the project, visit our [GitHub Issues](https://github.com/shellwork/XJTLU-Software-RAG/issues) page.
116
-
117
-
 
 
 
 
 
 
 
 
 
 
 
 
118
  Feel free to reach out through our GitHub repository for any questions, issues, or contributions related to **shellwork/ChatParts-qwen2.5-14b**.
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - shellwork/ChatParts_Dataset
5
+ language:
6
+ - zho
7
+ - eng
8
+ - fra
9
+ - spa
10
+ - por
11
+ - deu
12
+ - ita
13
+ - rus
14
+ - jpn
15
+ - kor
16
+ - vie
17
+ - tha
18
+ - ara
19
+ base_model:
20
+ - Qwen/Qwen2.5-14B-Instruct
21
+ pipeline_tag: question-answering
22
+ tags:
23
+ - biology
24
+ - medical
25
+ ---
26
+
27
+ # shellwork/ChatParts-qwen2.5-14b
28
+
29
+ 🤖 [XJTLU-Software RAG GitHub Repository](https://github.com/shellwork/XJTLU-Software-RAG/tree/main) 📊 [ChatParts Dataset](https://huggingface.co/datasets/shellwork/ChatParts_Dataset)
30
+
31
+ **shellwork/ChatParts-qwen2.5-14b** is a specialized dialogue model fine-tuned from **Qwen2.5-14B-Instruct** by the XJTLU-Software iGEM Competition team. This model is tailored for the synthetic biology domain, aiming to assist competition participants and researchers in efficiently collecting and organizing relevant information. It serves as the local model component of the XJTLU-developed Retrieval-Augmented Generation (RAG) software, enhancing search and summarization capabilities within synthetic biology data.
32
+
33
+ ## 📚 Dataset Information
34
+
35
+ The model is trained on a comprehensive synthetic biology-specific dataset curated from multiple authoritative sources:
36
+
37
+ - **iGEM Wiki Pages (2004-2023):** Comprehensive coverage of synthetic biology topics from over two decades of iGEM competitions.
38
+ - **Synthetic Biology Review Papers:** More than 1,000 high-quality review articles providing in-depth insights into various aspects of synthetic biology.
39
+ - **iGEM Parts Registry Documentation:** Detailed documentation of parts used in iGEM projects, facilitating accurate information retrieval.
40
+
41
+ In total, the dataset comprises over **200,000 question-answer pairs**, meticulously assembled to cover a wide spectrum of synthetic biology topics. For more detailed information about the dataset, please visit our [training data repository](https://huggingface.co/datasets/shellwork/ChatParts_Dataset).
42
+
43
+ ## 🛠️ How to Use
44
+
45
+ This repository supports usage with the `transformers` library. Below is a straightforward example of how to deploy the **shellwork/ChatParts-qwen2.5-14b** model using `transformers`.
46
+
47
+ ### 📋 Requirements
48
+
49
+ - **Transformers Library:** Ensure you have `transformers` version **>= 4.43.0** installed. You can update your installation using:
50
+
51
+ ```bash
52
+ pip install --upgrade transformers
53
+ ```
54
+
55
+ ### ⚙️ Example: Deploying with Transformers
56
+
57
+ ```python
58
+ import torch
59
+ from transformers import AutoModelForCausalLM, AutoTokenizer
60
+
61
+ # Load the tokenizer and model
62
+ model_name = "shellwork/ChatParts-qwen2.5-14b"
63
+ model = AutoModelForCausalLM.from_pretrained(
64
+ model_name,
65
+ torch_dtype="auto",
66
+ device_map="auto"
67
+ )
68
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
69
+
70
+ # Define the prompt and messages
71
+ prompt = "Give me a short introduction to synthetic biology."
72
+ messages = [
73
+ {"role": "system", "content": "You are ChatParts, a model specialized in synthetic biology created by XJTLU-Software."},
74
+ {"role": "user", "content": prompt}
75
+ ]
76
+
77
+ # Apply chat template
78
+ text = tokenizer.apply_chat_template(
79
+ messages,
80
+ tokenize=False,
81
+ add_generation_prompt=True
82
+ )
83
+
84
+ # Tokenize the input
85
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
86
+
87
+ # Generate the response
88
+ generated_ids = model.generate(
89
+ **model_inputs,
90
+ max_new_tokens=512
91
+ )
92
+
93
+ # Extract the generated tokens
94
+ generated_ids = [
95
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
96
+ ]
97
+
98
+ # Decode the response
99
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
100
+ print(response)
101
+ ```
102
+
103
+ ### 🔍 Explanation
104
+
105
+ 1. **Import Libraries:** Import the necessary libraries including `torch`, `modelscope`, and `transformers`.
106
+
107
+ 2. **Load Model and Tokenizer:** Use `AutoModelForCausalLM` and `AutoTokenizer` from `modelscope` to load the pre-trained model and tokenizer.
108
+
109
+ 3. **Define Prompt and Messages:** Create a prompt and define the conversation messages, including system and user roles.
110
+
111
+ 4. **Apply Chat Template:** Utilize the `apply_chat_template` method to format the messages appropriately for the model.
112
+
113
+ 5. **Tokenize Input:** Tokenize the formatted text and move it to the appropriate device (CPU/GPU).
114
+
115
+ 6. **Generate Response:** Use the `generate` method to produce a response with a specified maximum number of new tokens.
116
+
117
+ 7. **Decode and Print:** Decode the generated tokens to obtain the final text response and print it.
118
+
119
+ ## 📄 License
120
+
121
+ This model is released under the **Apache License 2.0**. For more details, please refer to the [license information](https://github.com/shellwork/XJTLU-Software-RAG/tree/main) in the repository.
122
+
123
+ ## 🔗 Additional Resources
124
+
125
+ - **RAG Software:** Explore the full capabilities of our Retrieval-Augmented Generation software [here](https://github.com/shellwork/XJTLU-Software-RAG/tree/main).
126
+ - **Training Data:** Access and review the extensive training dataset [here](https://huggingface.co/datasets/shellwork/ChatParts_Dataset).
127
+ - **Support & Contributions:** For support or to contribute to the project, visit our [GitHub Issues](https://github.com/shellwork/XJTLU-Software-RAG/issues) page.
128
+
129
+
130
  Feel free to reach out through our GitHub repository for any questions, issues, or contributions related to **shellwork/ChatParts-qwen2.5-14b**.