Update README.md
Browse files
README.md
CHANGED
|
@@ -7,332 +7,3 @@ language:
|
|
| 7 |
- zh
|
| 8 |
license: apache-2.0
|
| 9 |
---
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
<div style="display: flex; flex-direction: column; align-items: center; justify-content: center; text-align: center; font-size: 16px; font-weight: bold; margin-top: 50px;">
|
| 13 |
-
|
| 14 |
-
<div>
|
| 15 |
-
<a href="#english" style="text-decoration: none; margin: 0 10px; color: blue;">English</a> |
|
| 16 |
-
<a href="#chinese" style="text-decoration: none; margin: 0 10px; color: blue;">中文</a>
|
| 17 |
-
</div>
|
| 18 |
-
|
| 19 |
-
<h1 style="margin: 20px 0 0 0; font-size: 2.5em; font-weight: bold;">KHAOSZ </h1>
|
| 20 |
-
</div>
|
| 21 |
-
|
| 22 |
-
<h2 id="english">English Version</h2>
|
| 23 |
-
|
| 24 |
-
This is a Chinese-English bilingual Transformer model supporting both languages. It contains model configurations and training workflows, completing training by loading parameters defined in `params/config.json`. The training script `train.py` parses command-line arguments, including dataset root directory, number of training epochs, batch size, checkpoint interval, and checkpoint directory.
|
| 25 |
-
|
| 26 |
-
**Model Download Options (Choose One):**
|
| 27 |
-
|
| 28 |
-
1. Visit [HuggingFace](https://huggingface.co/ViperEk/KHAOSZ) to access **Files and versions**
|
| 29 |
-
2. Run `params/download.py` to download parameters
|
| 30 |
-
|
| 31 |
-
**Demo Video:** [bilibili](https://www.bilibili.com/video/BV1z5RPYHEkd)
|
| 32 |
-
|
| 33 |
-
Training dataset sources are listed in the **Model Card** section of the HuggingFace download link.
|
| 34 |
-
|
| 35 |
-
**License:** Code follows Apache-2.0 protocol. Please credit the source code when used.
|
| 36 |
-
|
| 37 |
-
- **📊 Device Selection:** Code defaults to CUDA training
|
| 38 |
-
- **🌐 Performance Optimization:** `dtype=torch.bfloat16` is enabled to accelerate training and reduce memory usage. Ensure hardware supports this feature.
|
| 39 |
-
- **🤖 Language Support:** Model supports Chinese and English training. The BBPE tokenizer was trained without multilingual text, so OOV (out-of-vocabulary) issues are minimized for these languages but may exist for others.
|
| 40 |
-
|
| 41 |
-
### 📌 Training Guide
|
| 42 |
-
|
| 43 |
-
To train this Transformer model, follow these steps:
|
| 44 |
-
|
| 45 |
-
**(1). Prepare Dataset:**
|
| 46 |
-
|
| 47 |
-
Place datasets in the designated root directory. Files should be text documents in Chinese, English, or mixed. Format should align with model input requirements - preferably pre-tokenized token_ids stored as `torch.Tensor` (using `torch.Tensor` saves memory compared to Python lists, which default to 64-bit precision).
|
| 48 |
-
|
| 49 |
-
**(2). Install Dependencies:**
|
| 50 |
-
|
| 51 |
-
```bash
|
| 52 |
-
pip install -r requirements.txt
|
| 53 |
-
pip install .
|
| 54 |
-
```
|
| 55 |
-
|
| 56 |
-
**(3). Run Training Script:**
|
| 57 |
-
|
| 58 |
-
```bash
|
| 59 |
-
python train.py \
|
| 60 |
-
--train_type=train_type[seq, sft, dpo] \
|
| 61 |
-
--data_root_path=/path/to/dataset \
|
| 62 |
-
--n_epoch=5 \
|
| 63 |
-
--batch_size=8 \
|
| 64 |
-
--max_lr=2e-4 \
|
| 65 |
-
--n_iter_ckpt=10000 \
|
| 66 |
-
--ckpt_dir checkpoints
|
| 67 |
-
```
|
| 68 |
-
|
| 69 |
-
**Parameters Explanation:**
|
| 70 |
-
- `--train_type`: Training type (seq, sft, dpo)
|
| 71 |
-
- `--data_root_path`: Dataset root directory
|
| 72 |
-
- `--n_epoch`: Total training epochs
|
| 73 |
-
- `--batch_size`: Batch size
|
| 74 |
-
- `--n_iter_step`: Number of batches per training step
|
| 75 |
-
- `--warning_step`: Warmup steps
|
| 76 |
-
- `--max_lr`: Maximum learning rate (uses warmup + cosine decay)
|
| 77 |
-
- `--n_iter_ckpt`: Checkpoint saving interval
|
| 78 |
-
- `--ckpt_dir`: Checkpoint directory
|
| 79 |
-
- `--resume_dir`: Path to resume training from checkpoint
|
| 80 |
-
|
| 81 |
-
Training logs are saved in `train_log.txt`. Checkpoints will be stored in the specified directory for resuming training or evaluation.
|
| 82 |
-
|
| 83 |
-
### 👉 Usage Guide
|
| 84 |
-
|
| 85 |
-
**(1). Chatting with the Model:**
|
| 86 |
-
|
| 87 |
-
Open `chat.py` or use streaming/non-streaming interfaces:
|
| 88 |
-
|
| 89 |
-
**Streaming Output:**
|
| 90 |
-
```python
|
| 91 |
-
import torch
|
| 92 |
-
from khaosz import Khaosz
|
| 93 |
-
|
| 94 |
-
model_dir = "your_model_parameter_dir"
|
| 95 |
-
model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
|
| 96 |
-
history = []
|
| 97 |
-
|
| 98 |
-
while True:
|
| 99 |
-
query = input(">> ")
|
| 100 |
-
if query == "!exit":
|
| 101 |
-
break
|
| 102 |
-
|
| 103 |
-
response_size = 0
|
| 104 |
-
for response, history in model.stream_generate(
|
| 105 |
-
query=query,
|
| 106 |
-
history=history,
|
| 107 |
-
temperature=0.85,
|
| 108 |
-
top_p=0.95,
|
| 109 |
-
top_k=50
|
| 110 |
-
):
|
| 111 |
-
print(response[response_size:], end="")
|
| 112 |
-
response_size = len(response)
|
| 113 |
-
```
|
| 114 |
-
|
| 115 |
-
**Non-streaming Output:**
|
| 116 |
-
```python
|
| 117 |
-
import torch
|
| 118 |
-
from khaosz import Khaosz
|
| 119 |
-
|
| 120 |
-
model_dir = "your_model_parameter_dir"
|
| 121 |
-
model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
|
| 122 |
-
history = []
|
| 123 |
-
|
| 124 |
-
while True:
|
| 125 |
-
query = input(">> ")
|
| 126 |
-
if query == "!exit":
|
| 127 |
-
break
|
| 128 |
-
|
| 129 |
-
response = model.generate(
|
| 130 |
-
query=query,
|
| 131 |
-
history=history,
|
| 132 |
-
temperature=0.85,
|
| 133 |
-
top_p=0.95,
|
| 134 |
-
top_k=50
|
| 135 |
-
)
|
| 136 |
-
print(response)
|
| 137 |
-
```
|
| 138 |
-
|
| 139 |
-
**(2) Retrieval-Augmented Generation (RAG):**
|
| 140 |
-
|
| 141 |
-
```python
|
| 142 |
-
import torch
|
| 143 |
-
from khaosz import Khaosz
|
| 144 |
-
|
| 145 |
-
model_dir = "your_model_parameter_dir"
|
| 146 |
-
model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
|
| 147 |
-
|
| 148 |
-
retrieved_content = model.retrieve_generate(
|
| 149 |
-
query=query,
|
| 150 |
-
retrieve_top_k=5,
|
| 151 |
-
temperature=0.6,
|
| 152 |
-
top_k=30,
|
| 153 |
-
top_p=0.95
|
| 154 |
-
)
|
| 155 |
-
print(retrieved_content)
|
| 156 |
-
```
|
| 157 |
-
|
| 158 |
-
### 📌 Model Specifications
|
| 159 |
-
|
| 160 |
-
This model is based on a 24-layer Transformer with parameters defined in `config.json`, totaling approximately 1.0 billion (1.0B) parameters.
|
| 161 |
-
|
| 162 |
-
**Key Design Choices:**
|
| 163 |
-
- Weight tying between embedding and final linear layers (standard for small models to save parameters)
|
| 164 |
-
- Embedding layer optimization: Without weight tying, a 10,000-word vocabulary would consume ~102M parameters (0.1B)
|
| 165 |
-
|
| 166 |
-
**Limitations:**
|
| 167 |
-
- May struggle with complex language phenomena due to smaller parameter size
|
| 168 |
-
- Prone to overfitting on specialized datasets
|
| 169 |
-
- Limited multilingual capabilities
|
| 170 |
-
|
| 171 |
-
**Advantages:**
|
| 172 |
-
- Runs efficiently on lower-spec hardware
|
| 173 |
-
- Shorter training time compared to larger models
|
| 174 |
-
|
| 175 |
-
**Training Pipeline:**
|
| 176 |
-
The model has completed pre-training + SFT (Supervised Fine-Tuning) + DPO (Direct Preference Optimization) workflows. All corresponding training code is included in the repository.
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
<h2 id="chinese">中文版本</h2>
|
| 180 |
-
这是一个支持中英文双语的 Transformer 模型,能够处理两种语言。模型包含配置文件和训练流程,通过加载 `params/config.json` 中定义的参数完成训练。训练脚本 `train.py` 支持命令行参数解析,包括数据集根目录、训练轮数(epochs)、批量大小(batch size)、检查点保存间隔、检查点目录等。
|
| 181 |
-
|
| 182 |
-
**模型下载选项(任选其一):**
|
| 183 |
-
|
| 184 |
-
1. 访问 [HuggingFace](https://huggingface.co/ViperEk/KHAOSZ) 查看 **Files and versions**
|
| 185 |
-
2. 运行 `params/download.py` 下载模型参数
|
| 186 |
-
|
| 187 |
-
**演示视频:** [bilibili](https://www.bilibili.com/video/BV1z5RPYHEkd)
|
| 188 |
-
|
| 189 |
-
训练数据来源请参见 HuggingFace 下载页面中的 **Model Card** 部分。
|
| 190 |
-
|
| 191 |
-
**许可证:** 代码遵循 Apache-2.0 协议,使用时请注明出处。
|
| 192 |
-
|
| 193 |
-
- **📊 设备选择:** 默认使用 CUDA 进行训练
|
| 194 |
-
- **🌐 性能优化:** 启用 `dtype=torch.bfloat16` 以加速训练并减少内存占用,请确保硬件支持该特性
|
| 195 |
-
- **🤖 语言支持:** 模型支持中文和英文训练。由于 BBPE 分词器未使用多语言文本训练,因此中英文的 OOV(未登录词)问题较少,其他语言可能存在 OOV 问题
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
### 📌 训练指南
|
| 200 |
-
|
| 201 |
-
要训练该 Transformer 模型,请按照以下步骤操作:
|
| 202 |
-
|
| 203 |
-
#### **(1). 准备数据集:**
|
| 204 |
-
|
| 205 |
-
将数据集放置在指定的根目录下。文件应为包含中文、英文或混合文本的文本文档。格式应符合模型输入要求——建议使用预分词后的 `token_ids` 并以 `torch.Tensor` 格式保存(使用 `torch.Tensor` 相比 Python 列表更节省内存,列表默认为 64 位精度)。
|
| 206 |
-
|
| 207 |
-
#### **(2). 安装依赖:**
|
| 208 |
-
|
| 209 |
-
```bash
|
| 210 |
-
pip install -r requirements.txt
|
| 211 |
-
pip install .
|
| 212 |
-
```
|
| 213 |
-
|
| 214 |
-
#### **(3). 运行训练脚本:**
|
| 215 |
-
|
| 216 |
-
```bash
|
| 217 |
-
python train.py \
|
| 218 |
-
--train_type=train_type[seq, sft, dpo] \
|
| 219 |
-
--data_root_path=/path/to/dataset \
|
| 220 |
-
--n_epoch=5 \
|
| 221 |
-
--batch_size=8 \
|
| 222 |
-
--max_lr=2e-4 \
|
| 223 |
-
--n_iter_ckpt=10000 \
|
| 224 |
-
--ckpt_dir checkpoints
|
| 225 |
-
```
|
| 226 |
-
|
| 227 |
-
**参数说明:**
|
| 228 |
-
- `--train_type`: 训练类型(seq, sft, dpo)
|
| 229 |
-
- `--data_root_path`: 数据集根目录
|
| 230 |
-
- `--n_epoch`: 总训练轮数
|
| 231 |
-
- `--batch_size`: 批量大小
|
| 232 |
-
- `--n_iter_step`: 每个训练步骤的 batch 数量
|
| 233 |
-
- `--warning_step`: 预热步数(warmup steps)
|
| 234 |
-
- `--max_lr`: 最大学习率(使用预热 + 余弦衰减)
|
| 235 |
-
- `--n_iter_ckpt`: 检查点保存间隔
|
| 236 |
-
- `--ckpt_dir`: 检查点保存目录
|
| 237 |
-
- `--resume_dir`: 从指定路径恢复训练
|
| 238 |
-
|
| 239 |
-
训练日志将保存在 `train_log.txt` 中。检查点将保存在指定目录,用于恢复训练或评估。
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
### 👉 使用指南
|
| 244 |
-
|
| 245 |
-
#### **(1). 与模型对话:**
|
| 246 |
-
|
| 247 |
-
打开 `chat.py` 或使用流式/非流式接口:
|
| 248 |
-
|
| 249 |
-
**流式输出:**
|
| 250 |
-
```python
|
| 251 |
-
import torch
|
| 252 |
-
from khaosz import Khaosz
|
| 253 |
-
|
| 254 |
-
model_dir = "your_model_parameter_dir"
|
| 255 |
-
model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
|
| 256 |
-
history = []
|
| 257 |
-
|
| 258 |
-
while True:
|
| 259 |
-
query = input(">> ")
|
| 260 |
-
if query == "!exit":
|
| 261 |
-
break
|
| 262 |
-
|
| 263 |
-
response_size = 0
|
| 264 |
-
for response, history in model.stream_generate(
|
| 265 |
-
query=query,
|
| 266 |
-
history=history,
|
| 267 |
-
temperature=0.85,
|
| 268 |
-
top_p=0.95,
|
| 269 |
-
top_k=50
|
| 270 |
-
):
|
| 271 |
-
print(response[response_size:], end="")
|
| 272 |
-
response_size = len(response)
|
| 273 |
-
```
|
| 274 |
-
|
| 275 |
-
**非流式输出:**
|
| 276 |
-
```python
|
| 277 |
-
import torch
|
| 278 |
-
from khaosz import Khaosz
|
| 279 |
-
|
| 280 |
-
model_dir = "your_model_parameter_dir"
|
| 281 |
-
model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
|
| 282 |
-
history = []
|
| 283 |
-
|
| 284 |
-
while True:
|
| 285 |
-
query = input(">> ")
|
| 286 |
-
if query == "!exit":
|
| 287 |
-
break
|
| 288 |
-
|
| 289 |
-
response = model.generate(
|
| 290 |
-
query=query,
|
| 291 |
-
history=history,
|
| 292 |
-
temperature=0.85,
|
| 293 |
-
top_p=0.95,
|
| 294 |
-
top_k=50
|
| 295 |
-
)
|
| 296 |
-
print(response)
|
| 297 |
-
```
|
| 298 |
-
|
| 299 |
-
#### **(2). 基于检索的生成(RAG):**
|
| 300 |
-
|
| 301 |
-
```python
|
| 302 |
-
import torch
|
| 303 |
-
from khaosz import Khaosz
|
| 304 |
-
|
| 305 |
-
model_dir = "your_model_parameter_dir"
|
| 306 |
-
model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
|
| 307 |
-
|
| 308 |
-
retrieved_content = model.retrieve_generate(
|
| 309 |
-
query=query,
|
| 310 |
-
retrieve_top_k=5,
|
| 311 |
-
temperature=0.6,
|
| 312 |
-
top_k=30,
|
| 313 |
-
top_p=0.95
|
| 314 |
-
)
|
| 315 |
-
print(retrieved_content)
|
| 316 |
-
```
|
| 317 |
-
|
| 318 |
-
|
| 319 |
-
|
| 320 |
-
### 📌 模型规格说明(重复部分)
|
| 321 |
-
|
| 322 |
-
该模型基于一个 24 层的 Transformer 架构,参数配置定义在 `config.json` 中,总参数量约为 10 亿(1.0B)。
|
| 323 |
-
|
| 324 |
-
**关键���计选择:**
|
| 325 |
-
- 在嵌入层(embedding)与最终线性层之间进行权重绑定(weight tying),这是小型模型中常见的节省参数量的做法
|
| 326 |
-
- 嵌入层优化:若不进行权重绑定,一个包含 10,000 个词的词汇表将消耗约 1.02 亿(0.1B)参数
|
| 327 |
-
|
| 328 |
-
**局限性:**
|
| 329 |
-
- 由于参数规模较小,可能在处理复杂语言现象时表现受限
|
| 330 |
-
- 在特定领域的数据集上容易出现过拟合
|
| 331 |
-
- 多语言能力有限
|
| 332 |
-
|
| 333 |
-
**优势:**
|
| 334 |
-
- 可在低配置硬件上高效运行
|
| 335 |
-
- 相较于大型模型,训练时间更短
|
| 336 |
-
|
| 337 |
-
**训练流程:**
|
| 338 |
-
该模型已完成预训练(pre-training)+ 监督微调(SFT, Supervised Fine-Tuning)+ 直接偏好优化(DPO, Direct Preference Optimization)的全流程。所有相关的训练代码均已包含在代码库中。
|
|
|
|
| 7 |
- zh
|
| 8 |
license: apache-2.0
|
| 9 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|