Update README.md
Browse files
README.md
CHANGED
|
@@ -74,11 +74,11 @@ MindGLM was trained using a combination of open-source datasets and self-constru
|
|
| 74 |
5. Training Process
|
| 75 |
The model underwent a three-phase training approach:
|
| 76 |
|
| 77 |
-
Supervised Fine-tuning: Using the ChatGLM2-6B foundational model, MindGLM was fine-tuned with a dedicated dataset for psychological counseling.
|
| 78 |
|
| 79 |
-
Reward Model Training: A reward model was trained to evaluate and score the responses of the fine-tuned model.
|
| 80 |
|
| 81 |
-
Reinforcement Learning: The model was further aligned using the PPO (Proximal Policy Optimization) algorithm to ensure its responses align with human preferences.
|
| 82 |
|
| 83 |
6. Limitations
|
| 84 |
While MindGLM is a powerful tool, users should be aware of its limitations:
|
|
@@ -165,11 +165,11 @@ MindGLM 结合使用开源数据集和自建数据集进行训练,以确保全
|
|
| 165 |
5. 训练过程
|
| 166 |
该模型采用了三阶段训练方法:(均使用了LoRA)
|
| 167 |
|
| 168 |
-
监督微调: 使用 ChatGLM2-6B 基础模型,用心理咨询专用数据集对 MindGLM 进行微调。
|
| 169 |
|
| 170 |
-
奖励模型训练: 对奖励模型进行训练,以对微调模型的响应进行评估和评分。
|
| 171 |
|
| 172 |
-
强化学习: 使用 PPO(近端策略优化)算法对模型进行进一步调整,以确保其反应符合人类的偏好。
|
| 173 |
|
| 174 |
6. 局限性
|
| 175 |
虽然 MindGLM 是一款功能强大的工具,但用户也应了解其局限性:
|
|
|
|
| 74 |
5. Training Process
|
| 75 |
The model underwent a three-phase training approach:
|
| 76 |
|
| 77 |
+
- Supervised Fine-tuning: Using the ChatGLM2-6B foundational model, MindGLM was fine-tuned with a dedicated dataset for psychological counseling.
|
| 78 |
|
| 79 |
+
- Reward Model Training: A reward model was trained to evaluate and score the responses of the fine-tuned model.
|
| 80 |
|
| 81 |
+
- Reinforcement Learning: The model was further aligned using the PPO (Proximal Policy Optimization) algorithm to ensure its responses align with human preferences.
|
| 82 |
|
| 83 |
6. Limitations
|
| 84 |
While MindGLM is a powerful tool, users should be aware of its limitations:
|
|
|
|
| 165 |
5. 训练过程
|
| 166 |
该模型采用了三阶段训练方法:(均使用了LoRA)
|
| 167 |
|
| 168 |
+
- 监督微调: 使用 ChatGLM2-6B 基础模型,用心理咨询专用数据集对 MindGLM 进行微调。
|
| 169 |
|
| 170 |
+
- 奖励模型训练: 对奖励模型进行训练,以对微调模型的响应进行评估和评分。
|
| 171 |
|
| 172 |
+
- 强化学习: 使用 PPO(近端策略优化)算法对模型进行进一步调整,以确保其反应符合人类的偏好。
|
| 173 |
|
| 174 |
6. 局限性
|
| 175 |
虽然 MindGLM 是一款功能强大的工具,但用户也应了解其局限性:
|