Safetensors
Chinese
ViperEk commited on
Commit
df7ceac
·
verified ·
1 Parent(s): 232a7e3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -329
README.md CHANGED
@@ -7,332 +7,3 @@ language:
7
  - zh
8
  license: apache-2.0
9
  ---
10
-
11
-
12
- <div style="display: flex; flex-direction: column; align-items: center; justify-content: center; text-align: center; font-size: 16px; font-weight: bold; margin-top: 50px;">
13
-
14
- <div>
15
- <a href="#english" style="text-decoration: none; margin: 0 10px; color: blue;">English</a> |
16
- <a href="#chinese" style="text-decoration: none; margin: 0 10px; color: blue;">中文</a>
17
- </div>
18
-
19
- <h1 style="margin: 20px 0 0 0; font-size: 2.5em; font-weight: bold;">KHAOSZ </h1>
20
- </div>
21
-
22
- <h2 id="english">English Version</h2>
23
-
24
- This is a Chinese-English bilingual Transformer model supporting both languages. It contains model configurations and training workflows, completing training by loading parameters defined in `params/config.json`. The training script `train.py` parses command-line arguments, including dataset root directory, number of training epochs, batch size, checkpoint interval, and checkpoint directory.
25
-
26
- **Model Download Options (Choose One):**
27
-
28
- 1. Visit [HuggingFace](https://huggingface.co/ViperEk/KHAOSZ) to access **Files and versions**
29
- 2. Run `params/download.py` to download parameters
30
-
31
- **Demo Video:** [bilibili](https://www.bilibili.com/video/BV1z5RPYHEkd)
32
-
33
- Training dataset sources are listed in the **Model Card** section of the HuggingFace download link.
34
-
35
- **License:** Code follows Apache-2.0 protocol. Please credit the source code when used.
36
-
37
- - **📊 Device Selection:** Code defaults to CUDA training
38
- - **🌐 Performance Optimization:** `dtype=torch.bfloat16` is enabled to accelerate training and reduce memory usage. Ensure hardware supports this feature.
39
- - **🤖 Language Support:** Model supports Chinese and English training. The BBPE tokenizer was trained without multilingual text, so OOV (out-of-vocabulary) issues are minimized for these languages but may exist for others.
40
-
41
- ### 📌 Training Guide
42
-
43
- To train this Transformer model, follow these steps:
44
-
45
- **(1). Prepare Dataset:**
46
-
47
- Place datasets in the designated root directory. Files should be text documents in Chinese, English, or mixed. Format should align with model input requirements - preferably pre-tokenized token_ids stored as `torch.Tensor` (using `torch.Tensor` saves memory compared to Python lists, which default to 64-bit precision).
48
-
49
- **(2). Install Dependencies:**
50
-
51
- ```bash
52
- pip install -r requirements.txt
53
- pip install .
54
- ```
55
-
56
- **(3). Run Training Script:**
57
-
58
- ```bash
59
- python train.py \
60
- --train_type=train_type[seq, sft, dpo] \
61
- --data_root_path=/path/to/dataset \
62
- --n_epoch=5 \
63
- --batch_size=8 \
64
- --max_lr=2e-4 \
65
- --n_iter_ckpt=10000 \
66
- --ckpt_dir checkpoints
67
- ```
68
-
69
- **Parameters Explanation:**
70
- - `--train_type`: Training type (seq, sft, dpo)
71
- - `--data_root_path`: Dataset root directory
72
- - `--n_epoch`: Total training epochs
73
- - `--batch_size`: Batch size
74
- - `--n_iter_step`: Number of batches per training step
75
- - `--warning_step`: Warmup steps
76
- - `--max_lr`: Maximum learning rate (uses warmup + cosine decay)
77
- - `--n_iter_ckpt`: Checkpoint saving interval
78
- - `--ckpt_dir`: Checkpoint directory
79
- - `--resume_dir`: Path to resume training from checkpoint
80
-
81
- Training logs are saved in `train_log.txt`. Checkpoints will be stored in the specified directory for resuming training or evaluation.
82
-
83
- ### 👉 Usage Guide
84
-
85
- **(1). Chatting with the Model:**
86
-
87
- Open `chat.py` or use streaming/non-streaming interfaces:
88
-
89
- **Streaming Output:**
90
- ```python
91
- import torch
92
- from khaosz import Khaosz
93
-
94
- model_dir = "your_model_parameter_dir"
95
- model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
96
- history = []
97
-
98
- while True:
99
- query = input(">> ")
100
- if query == "!exit":
101
- break
102
-
103
- response_size = 0
104
- for response, history in model.stream_generate(
105
- query=query,
106
- history=history,
107
- temperature=0.85,
108
- top_p=0.95,
109
- top_k=50
110
- ):
111
- print(response[response_size:], end="")
112
- response_size = len(response)
113
- ```
114
-
115
- **Non-streaming Output:**
116
- ```python
117
- import torch
118
- from khaosz import Khaosz
119
-
120
- model_dir = "your_model_parameter_dir"
121
- model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
122
- history = []
123
-
124
- while True:
125
- query = input(">> ")
126
- if query == "!exit":
127
- break
128
-
129
- response = model.generate(
130
- query=query,
131
- history=history,
132
- temperature=0.85,
133
- top_p=0.95,
134
- top_k=50
135
- )
136
- print(response)
137
- ```
138
-
139
- **(2) Retrieval-Augmented Generation (RAG):**
140
-
141
- ```python
142
- import torch
143
- from khaosz import Khaosz
144
-
145
- model_dir = "your_model_parameter_dir"
146
- model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
147
-
148
- retrieved_content = model.retrieve_generate(
149
- query=query,
150
- retrieve_top_k=5,
151
- temperature=0.6,
152
- top_k=30,
153
- top_p=0.95
154
- )
155
- print(retrieved_content)
156
- ```
157
-
158
- ### 📌 Model Specifications
159
-
160
- This model is based on a 24-layer Transformer with parameters defined in `config.json`, totaling approximately 1.0 billion (1.0B) parameters.
161
-
162
- **Key Design Choices:**
163
- - Weight tying between embedding and final linear layers (standard for small models to save parameters)
164
- - Embedding layer optimization: Without weight tying, a 10,000-word vocabulary would consume ~102M parameters (0.1B)
165
-
166
- **Limitations:**
167
- - May struggle with complex language phenomena due to smaller parameter size
168
- - Prone to overfitting on specialized datasets
169
- - Limited multilingual capabilities
170
-
171
- **Advantages:**
172
- - Runs efficiently on lower-spec hardware
173
- - Shorter training time compared to larger models
174
-
175
- **Training Pipeline:**
176
- The model has completed pre-training + SFT (Supervised Fine-Tuning) + DPO (Direct Preference Optimization) workflows. All corresponding training code is included in the repository.
177
-
178
-
179
- <h2 id="chinese">中文版本</h2>
180
- 这是一个支持中英文双语的 Transformer 模型,能够处理两种语言。模型包含配置文件和训练流程,通过加载 `params/config.json` 中定义的参数完成训练。训练脚本 `train.py` 支持命令行参数解析,包括数据集根目录、训练轮数(epochs)、批量大小(batch size)、检查点保存间隔、检查点目录等。
181
-
182
- **模型下载选项(任选其一):**
183
-
184
- 1. 访问 [HuggingFace](https://huggingface.co/ViperEk/KHAOSZ) 查看 **Files and versions**
185
- 2. 运行 `params/download.py` 下载模型参数
186
-
187
- **演示视频:** [bilibili](https://www.bilibili.com/video/BV1z5RPYHEkd)
188
-
189
- 训练数据来源请参见 HuggingFace 下载页面中的 **Model Card** 部分。
190
-
191
- **许可证:** 代码遵循 Apache-2.0 协议,使用时请注明出处。
192
-
193
- - **📊 设备选择:** 默认使用 CUDA 进行训练
194
- - **🌐 性能优化:** 启用 `dtype=torch.bfloat16` 以加速训练并减少内存占用,请确保硬件支持该特性
195
- - **🤖 语言支持:** 模型支持中文和英文训练。由于 BBPE 分词器未使用多语言文本训练,因此中英文的 OOV(未登录词)问题较少,其他语言可能存在 OOV 问题
196
-
197
-
198
-
199
- ### 📌 训练指南
200
-
201
- 要训练该 Transformer 模型,请按照以下步骤操作:
202
-
203
- #### **(1). 准备数据集:**
204
-
205
- 将数据集放置在指定的根目录下。文件应为包含中文、英文或混合文本的文本文档。格式应符合模型输入要求——建议使用预分词后的 `token_ids` 并以 `torch.Tensor` 格式保存(使用 `torch.Tensor` 相比 Python 列表更节省内存,列表默认为 64 位精度)。
206
-
207
- #### **(2). 安装依赖:**
208
-
209
- ```bash
210
- pip install -r requirements.txt
211
- pip install .
212
- ```
213
-
214
- #### **(3). 运行训练脚本:**
215
-
216
- ```bash
217
- python train.py \
218
- --train_type=train_type[seq, sft, dpo] \
219
- --data_root_path=/path/to/dataset \
220
- --n_epoch=5 \
221
- --batch_size=8 \
222
- --max_lr=2e-4 \
223
- --n_iter_ckpt=10000 \
224
- --ckpt_dir checkpoints
225
- ```
226
-
227
- **参数说明:**
228
- - `--train_type`: 训练类型(seq, sft, dpo)
229
- - `--data_root_path`: 数据集根目录
230
- - `--n_epoch`: 总训练轮数
231
- - `--batch_size`: 批量大小
232
- - `--n_iter_step`: 每个训练步骤的 batch 数量
233
- - `--warning_step`: 预热步数(warmup steps)
234
- - `--max_lr`: 最大学习率(使用预热 + 余弦衰减)
235
- - `--n_iter_ckpt`: 检查点保存间隔
236
- - `--ckpt_dir`: 检查点保存目录
237
- - `--resume_dir`: 从指定路径恢复训练
238
-
239
- 训练日志将保存在 `train_log.txt` 中。检查点将保存在指定目录,用于恢复训练或评估。
240
-
241
-
242
-
243
- ### 👉 使用指南
244
-
245
- #### **(1). 与模型对话:**
246
-
247
- 打开 `chat.py` 或使用流式/非流式接口:
248
-
249
- **流式输出:**
250
- ```python
251
- import torch
252
- from khaosz import Khaosz
253
-
254
- model_dir = "your_model_parameter_dir"
255
- model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
256
- history = []
257
-
258
- while True:
259
- query = input(">> ")
260
- if query == "!exit":
261
- break
262
-
263
- response_size = 0
264
- for response, history in model.stream_generate(
265
- query=query,
266
- history=history,
267
- temperature=0.85,
268
- top_p=0.95,
269
- top_k=50
270
- ):
271
- print(response[response_size:], end="")
272
- response_size = len(response)
273
- ```
274
-
275
- **非流式输出:**
276
- ```python
277
- import torch
278
- from khaosz import Khaosz
279
-
280
- model_dir = "your_model_parameter_dir"
281
- model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
282
- history = []
283
-
284
- while True:
285
- query = input(">> ")
286
- if query == "!exit":
287
- break
288
-
289
- response = model.generate(
290
- query=query,
291
- history=history,
292
- temperature=0.85,
293
- top_p=0.95,
294
- top_k=50
295
- )
296
- print(response)
297
- ```
298
-
299
- #### **(2). 基于检索的生成(RAG):**
300
-
301
- ```python
302
- import torch
303
- from khaosz import Khaosz
304
-
305
- model_dir = "your_model_parameter_dir"
306
- model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
307
-
308
- retrieved_content = model.retrieve_generate(
309
- query=query,
310
- retrieve_top_k=5,
311
- temperature=0.6,
312
- top_k=30,
313
- top_p=0.95
314
- )
315
- print(retrieved_content)
316
- ```
317
-
318
-
319
-
320
- ### 📌 模型规格说明(重复部分)
321
-
322
- 该模型基于一个 24 层的 Transformer 架构,参数配置定义在 `config.json` 中,总参数量约为 10 亿(1.0B)。
323
-
324
- **关键���计选择:**
325
- - 在嵌入层(embedding)与最终线性层之间进行权重绑定(weight tying),这是小型模型中常见的节省参数量的做法
326
- - 嵌入层优化:若不进行权重绑定,一个包含 10,000 个词的词汇表将消耗约 1.02 亿(0.1B)参数
327
-
328
- **局限性:**
329
- - 由于参数规模较小,可能在处理复杂语言现象时表现受限
330
- - 在特定领域的数据集上容易出现过拟合
331
- - 多语言能力有限
332
-
333
- **优势:**
334
- - 可在低配置硬件上高效运行
335
- - 相较于大型模型,训练时间更短
336
-
337
- **训练流程:**
338
- 该模型已完成预训练(pre-training)+ 监督微调(SFT, Supervised Fine-Tuning)+ 直接偏好优化(DPO, Direct Preference Optimization)的全流程。所有相关的训练代码均已包含在代码库中。
 
7
  - zh
8
  license: apache-2.0
9
  ---