File size: 29,558 Bytes
8eb2cb0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 | # komt : Korean Multi-task Instruction Tuning

Recently, due to the success of ChatGPT, numerous large language models have emerged in an attempt to catch up with ChatGPT's capabilities.
However, when it comes to Korean language performance, it has been observed that many models still struggle to provide accurate answers or generate Korean text effectively.
This study addresses these challenges by introducing a multi-task instruction technique that leverages supervised datasets from various tasks to create training data for Large Language Models (LLMs).
## News or Update
### 2023.12.05
- dpo train μ½λ κ³΅κ° [dpo_train.py](dpo_train.py)
### 2023.11.29
- komt-mistral-7b-v1-dpo : dpo(Direct Preference Optimization) νμ΅ λͺ¨λΈ μΆκ°
> - [davidkim205/komt-mistral-7b-v1-dpo](https://huggingface.co/davidkim205/komt-mistral-7b-v1-dpo/blob/main/README.md)
- komt-mistral-7b-v1-dpo νκ°κ²°κ³Ό νμ¬ komtλͺ¨λΈ μ€μμ κ°μ₯λμ μ±λ₯μΈ 76.75%κΈ°λ‘.. (gpt-3.5-turbo 79.45%)
### 2023.10.24
- komt-mistral-7b-v1 λͺ¨λΈ μΆκ°
> - [davidkim205/komt-mistral-7b-v1](https://huggingface.co/davidkim205/komt-mistral-7b-v1)
> - [davidkim205/komt-mistral-7b-v1-lora](https://huggingface.co/davidkim205/komt-mistral-7b-v1-lora)
> - [davidkim205/komt-mistral-7b-v1-gguf](https://huggingface.co/davidkim205/komt-mistral-7b-v1-gguf)
### 2023.10.20
- komt-llama-30b-v1 λͺ¨λΈ μΆκ°
> - [davidkim205/komt-llama-30b-v1](https://huggingface.co/davidkim205/komt-llama-30b-v1)
> - [davidkim205/komt-llama-30b-v1-lora](https://huggingface.co/davidkim205/komt-llama-30b-v1-lora)
### 2023.09.27
- chatgpt κΈ°λ° νκ° κ²°κ³Όμ μλ λͺ¨λΈ μΆκ°
> - naver Cue
> - clova X
> - nlpai-lab/kullm-polyglot-12.8b-v2
> - kfkas/Llama-2-ko-7b-Chat
> - beomi/KoAlpaca-Polyglot-12.8B
### 2023.09.25
- komt-llama2-13b-v1 λͺ¨λΈ μΆκ°
> - [davidkim205/komt-llama2-13b-v1](https://huggingface.co/davidkim205/komt-llama2-13b-v1)
> - [davidkim205/komt-llama2-13b-v1-lora](https://huggingface.co/davidkim205/komt-llama2-13b-v1-lora)
> - [davidkim205/komt-llama2-13b-v1-ggml](https://huggingface.co/davidkim205/komt-llama2-13b-v1-ggml)
### 2023.09.24
- Fine-tune with deepspeed νμ΅ λ°©λ² μΆκ°
### 2023.09.23
- usage komt with vllm μ½λμ μ€μΉ λ°©λ² μΆκ°
### 2023.09.22
- λͺ¨λΈ νκ° κ²°κ³Όν μΆκ°
### 2023.09.20
- finetune_with_lora νμ΅μ 4bit, 8bit μ ννμ¬ νμ΅ν μ μλλ‘ κΈ°λ₯μΆκ°
### 2023.09.19
- komt-llama2 λͺ¨λΈμ μ½κ² μ¬μ©ν μ μλλ‘ μμ μ νμ΅ λ°©λ², λ°μ΄ν°μ
μ μΆκ°ν©λλ€.
### 2023.09.17
- κ°μ λ multi-task datasetμΌλ‘ νμ΅ν komt-llama2-7b-v1 λͺ¨λΈμ λ°°ν¬ν©λλ€.(κ°λμ© end token μ μ©μ΄ μλλ λ¬Έμ , λ΅λ³μ λ무 κΈΈκ² νλ λ¬Έμ λ± μμ )
- [davidkim205/komt-llama2-7b-v1](https://huggingface.co/davidkim205/komt-llama2-7b-v1)
- [davidkim205/komt-llama2-7b-v1-lora](https://huggingface.co/davidkim205/komt-llama2-7b-v1-lora)
- [davidkim205/komt-llama2-7b-v1-ggml](https://huggingface.co/davidkim205/komt-llama2-7b-v1-ggml)
### 2023.08.16
- We are releasing the [davidkim205/komt-Llama-2-7b-chat-hf-ggml](https://huggingface.co/davidkim205/komt-Llama-2-7b-chat-hf-ggml) model
### 2023.08.17
- We are releasing the [davidkim205/komt-Llama-2-13b-hf-lora](https://huggingface.co/davidkim205/komt-Llama-2-13b-hf-lora) and [davidkim205/komt-Llama-2-13b-hf-ggml]https://huggingface.co/davidkim205/komt-Llama-2-13b-hf-ggml) models
## Released Model Checkpoints
### komt-llama2-7b
- [davidkim205/komt-llama2-7b-v1](https://huggingface.co/davidkim205/komt-llama2-7b-v1)
- [davidkim205/komt-llama2-7b-v1-lora](https://huggingface.co/davidkim205/komt-llama2-7b-v1-lora)
- [davidkim205/komt-llama2-7b-v1-ggml](https://huggingface.co/davidkim205/komt-llama2-7b-v1-ggml)
### komt-llama2-13b
- [davidkim205/komt-llama2-13b-v1](https://huggingface.co/davidkim205/komt-llama2-13b-v1)
- [davidkim205/komt-llama2-13b-v1-lora](https://huggingface.co/davidkim205/komt-llama2-13b-v1-lora)
- [davidkim205/komt-llama2-13b-v1-ggml](https://huggingface.co/davidkim205/komt-llama2-13b-v1-ggml)
### komt-llama-30b
- [davidkim205/komt-llama-30b-v1](https://huggingface.co/davidkim205/komt-llama-30b-v1)
- [davidkim205/komt-llama-30b-v1-lora](https://huggingface.co/davidkim205/komt-llama-30b-v1-lora)
### komt-mistral-7b
- [davidkim205/komt-mistral-7b-v1](https://huggingface.co/davidkim205/komt-mistral-7b-v1)
- [davidkim205/komt-mistral-7b-v1-lora](https://huggingface.co/davidkim205/komt-mistral-7b-v1-lora)
- [davidkim205/komt-mistral-7b-v1-gguf](https://huggingface.co/davidkim205/komt-mistral-7b-v1-gguf)
- [davidkim205/komt-mistral-7b-v1-dpo](https://huggingface.co/davidkim205/komt-mistral-7b-v1-dpo)
## Hardware and Software
- nvidia driver : 535.54.03
- CUDA Version: 12.2
## Setup
```
git clone https://github.com/davidkim205/komt.git
cd komt
conda create -n komt python=3.10
conda activate komt
pip install -r requirements.txt
```
## Usage
μ°λ¦¬λ komt-llama2 λͺ¨λΈμ μ¬μ©ν μ μλ λ€μν λ°©λ²μ μ 곡ν©λλ€.
## transformers
```
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import TextStreamer, GenerationConfig
model_name='davidkim205/komt-llama2-7b-v1'
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
streamer = TextStreamer(tokenizer)
def gen(x):
generation_config = GenerationConfig(
temperature=0.8,
top_p=0.8,
top_k=100,
max_new_tokens=512,
early_stopping=True,
do_sample=True,
)
q = f"### instruction: {x}\n\n### Response: "
gened = model.generate(
**tokenizer(
q,
return_tensors='pt',
return_token_type_ids=False
).to('cuda'),
generation_config=generation_config,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
streamer=streamer,
)
result_str = tokenizer.decode(gened[0])
start_tag = f"\n\n### Response: "
start_index = result_str.find(start_tag)
if start_index != -1:
result_str = result_str[start_index + len(start_tag):].strip()
return result_str
print(gen('μ μ£Όλλ₯Ό 1λ°2μΌλ‘ νΌμ μ¬ννλ €κ³ νλλ° μ¬ν μ½μ€λ₯Ό λ§λ€μ΄μ€'))
```
κ²°κ³Ό
```
### Response: μ μ£Όλλ₯Ό 1λ°2μΌλ‘ νΌμ μ¬ννλ €λ©΄ λ€μκ³Ό κ°μ μ¬ν μ½μ€λ₯Ό λ§λ€μ΄ κ³νν μ μμ΅λλ€:
1μΌμ°¨:
- μμΉ¨: μ μ£Όλμ μλ¦λ€μ΄ ν΄λ³μ ꡬ경νκΈ° μν΄ ν΄λ³μ λμ°©νμΈμ. μΌμΆμ κ°μνλ©° μμ°μ μλ¦λ€μμ λ§λ½νμΈμ.
- μ€ν: μ μ£Όλμ λνμ μΈ κ΄κ΄μ§μΈ νλΌμ°μ νννμΈμ. λ±μ°λ‘λ₯Ό λ°λΌ μ¬λΌκ°λ©΄μ κ²½μΉλ₯Ό μ¦κΈ°κ³ μ€λͺ
μ λ£μΌλ©° μ¬μ΄ μ°μ±
μ μ¦κΈ°μΈμ.
- μ λ
: μ μ£Όλμ λ§μλ μμμ μμ μ λ
μ 보λ΄μΈμ. μ μ ν ν΄μ°λ¬Όκ³Ό ν₯μ λ£λ‘ λ§λ μμμ λ§λ³΄λ κ²μ μ μ£Όλ μ¬νμ μλ²½ν κ²½νμ΄ λ κ²μ
λλ€.
2μΌμ°¨:
- μμΉ¨: νλΌμ° μΌλλ₯Ό νννκΈ° μν΄ νλΌμ° μΌμ΄νλ‘ μ΄λνμΈμ. μ΄ μΌμ΄νλ λ±μ°μ μ¦κΈ°λ μ¬λλ€μκ² μ΅μ μ μ νμ
λλ€.
```
### text-generation-webui

```
# text-generation-webui μ½λ λ°κΈ°
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui/
# conda νκ²½μμ±
conda create -n text-generation-webui python=3.10
conda activate text-generation-webui
# pip install
pip install -r requirements.txt
# model download
pip install huggingface-hub
python -c "from huggingface_hub import hf_hub_download;print(hf_hub_download(repo_id='davidkim205/komt-llama2-7b-v1-ggml', filename='ggml-model-q4_0.gguf', local_dir='./models/'))"
# server μ€ν
python server.py
```
### llama2-webui

https://github.com/liltom-eth/llama2-webui
llama2-webuiλ₯Ό git cloneν requirementsλ₯Ό install ν©λλ€. κ·Έλ°λ€μ μ©λμ΄ ν¬κΈ°λλ¬Έμ git lfsμ μ΄μ©νμ¬ komt-llama2-7bλ₯Ό λ€μ΄λ‘λ λ°μ΅λλ€.
```
git clone https://github.com/liltom-eth/llama2-webui.git
cd llama2-webui
pip install -r requirements.txt
```
modelμ λ€μ΄λ‘λν appμ μ€νν©λλ€.
```
sudo apt install git-lfs
git lfs clone https://huggingface.co/davidkim205/komt-llama2-7b-v1
python app.py --backend_type transformers --model_path ./komt-llama2-7b-v1/
```
### llama.cpp

```
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
pip install -r requirements.txt
pip install huggingface-hub
python -c "from huggingface_hub import hf_hub_download;print(hf_hub_download(repo_id='davidkim205/komt-llama2-7b-v1-ggml', filename='ggml-model-q4_0.gguf', local_dir='./models/'))"
make -j && ./main -m ./models/ggml-model-q4_0.gguf -p "μΈμΌμ μ΄λ€ ν¨κ³Όκ° μλκ°μ? ##output:"
```
### llama.cpp with google colab
google colabμμ llama.cppλ₯Ό μ¬μ©νμ¬ komtλ₯Ό μ¬μ©νλ λ°©λ²
https://colab.research.google.com/drive/1uLHXv-6NT7yj4FHECrZezfo5pVL-ht63?usp=sharing
### usage_komt_with_lora
pythonκ³Ό jupyterλ₯Ό μ΄μ©ν μμ μ
λλ€.
- [usage_komt_with_lora.py](usage_komt_with_lora.py)
- [usage_komt_with_lora.ipynb](usage_komt_with_lora.ipynb)
```
$ python infer.py
Downloading (β¦)/adapter_config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 528/528 [00:00<00:00, 5.02MB/s]
Downloading (β¦)lve/main/config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 631/631 [00:00<00:00, 4.96MB/s]
Downloading pytorch_model.bin: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 27.0G/27.0G [04:29<00:00, 100MB/s]
Downloading (β¦)neration_config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 183/183 [00:00<00:00, 1.36MB/s]
Downloading adapter_model.bin: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 80.1M/80.1M [00:00<00:00, 82.7MB/s]
Downloading (β¦)okenizer_config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 749/749 [00:00<00:00, 6.66MB/s]
Downloading tokenizer.model: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 500k/500k [00:00<00:00, 111MB/s]
Downloading (β¦)in/added_tokens.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 21.0/21.0 [00:00<00:00, 131kB/s]
Downloading (β¦)cial_tokens_map.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 96.0/96.0 [00:00<00:00, 608kB/s]
/home/david/anaconda3/envs/komt/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:399: UserWarning: `num_beams` is set to 1. However, `early_stopping` is set to `True` -- this flag is only used in beam-based generation modes. You should set `num_beams>1` or unset `early_stopping`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
warnings.warn(
/home/david/anaconda3/envs/komt/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:399: UserWarning: `num_beams` is set to 1. However, `early_stopping` is set to `True` -- this flag is only used in beam-based generation modes. You should set `num_beams>1` or unset `early_stopping`.
warnings.warn(
<s> ### instruction: κ³ μμ΄λ μ λ¬Όμ μ«μ΄νλμ?
### Response: κ³ μμ΄λ μ¬λκ³Ό λ¬λ¦¬ λ¬Όμ μ«μ΄ν©λλ€. μ΄λ λ¬Όμ λ
Ήμ μλ ν€μ΄μ³λ°κ³Ό λ¬Όμ λμ λλ¬Έμ
λλ€. κ³ μμ΄λ ν€μ΄μ³λ°μ΄ λ¬Όμ λ
Ήμ μμΌλ©΄ λ¬Όμ λ§μκ³ μΆμ§ μμνλ©°, λ¬Όμ λμμλ λ―Όκ°ν©λλ€. μ΄λ¬ν μ΄μ λ‘ κ³ μμ΄λ λ¬Όμ μ«μ΄νκ² λμμ΅λλ€.
κ³ μμ΄λ μ¬λκ³Ό λ¬λ¦¬ 체μ¨μ΄ λμ 체μ¨μ μ μ§νκΈ° μν΄ λ§μ μΉΌλ‘리λ₯Ό νμλ‘ ν©λλ€. λ°λΌμ κ³ μμ΄λ λ¬Όμ λ§μμ§ μκ³ λ¬Όμ μ«μ΄ν©λλ€. κ³ μμ΄λ 체μ¨μ μ μ§νκΈ° μν΄ λ¬Όμ μμ·¨νμ§ μμΌλ©°, λ¬Όμ λ§μκ³ μΆμ§ μμ΅λλ€.
λν, κ³ μμ΄λ λ¬Όμ λ§μλ©΄ μμ΄ μ°¨κ°μμ§λ λ± λ¬Όμ λ
Ήμ μλ ν€μ΄μ³λ° λλ¬Έμ λ¬Όμ μ«μ΄ν©λλ€. ν€μ΄μ³λ°μ λ¬Όμ λ
Ήμ¬ μμ
κ³ μμ΄λ μ¬λκ³Ό λ¬λ¦¬ λ¬Όμ μ«μ΄ν©λλ€. μ΄λ λ¬Όμ λ
Ήμ μλ ν€μ΄μ³λ°κ³Ό λ¬Όμ λμ λλ¬Έμ
λλ€. κ³ μμ΄λ ν€μ΄μ³λ°μ΄ λ¬Όμ λ
Ήμ μμΌλ©΄ λ¬Όμ λ§μκ³ μΆμ§ μμνλ©°, λ¬Όμ λμμλ λ―Όκ°ν©λλ€. μ΄λ¬ν μ΄μ λ‘ κ³ μμ΄λ λ¬Όμ μ«μ΄νκ² λμμ΅λλ€.
κ³ μμ΄λ μ¬λκ³Ό λ¬λ¦¬ 체μ¨μ΄ λμ 체μ¨μ μ μ§νκΈ° μν΄ λ§μ μΉΌλ‘리λ₯Ό νμλ‘ ν©λλ€. λ°λΌμ κ³ μμ΄λ λ¬Όμ λ§μμ§ μκ³ λ¬Όμ μ«μ΄ν©λλ€. κ³ μμ΄λ 체μ¨μ μ μ§νκΈ° μν΄ λ¬Όμ μμ·¨νμ§ μμΌλ©°, λ¬Όμ λ§μκ³ μΆμ§ μμ΅λλ€.
```
### usage komt with vllm

vllm λΌμ΄λΈλ¬λ¦¬λ₯Ό μ¬μ©νκΈ° μν΄μλ μλμ κ°μ΄ conda νκ²½μ μμ±ννμ requirements_vllm.txtμΌλ‘ ν¨ν€μ§λ€μ μ€μΉν΄μΌν©λλ€.
```
conda create -n vllm python=3.10
conda activate vllm
pip install -r requirements_vllm.txt
```
μμ μ½λλ μλμ κ°μ΄ μ€νννμ μ§λ¬Έμ μ
λ ₯νλ©΄ λ©λλ€.
```
$ python usage_komt_with_vllm.py
INFO 09-25 18:48:20 llm_engine.py:72] Initializing an LLM engine with config: model='davidkim205/komt-llama2-7b-v1', tokenizer='davidkim205/komt-llama2-7b-v1', tokenizer_mode=auto, trust_remote_code=False, dtype=torch.float16, download_dir=None, load_format=auto, tensor_parallel_size=1, seed=0)
INFO 09-25 18:48:20 tokenizer.py:30] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 09-25 18:48:36 llm_engine.py:199] # GPU blocks: 1048, # CPU blocks: 512
>μ μ£Όλ λ°μ΄νΈ μ½μ€ μλ €μ€
Processed prompts: 100%|ββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:15<00:00, 15.30s/it]
Prompt: '### instruction: μ μ£Όλ λ°μ΄νΈ μ½μ€ μλ €μ€\n\n### Response: ', Generated text: 'μ μ£Όλ λ°μ΄νΈ μ½μ€ μλ €λλ¦¬κ² μ΅λλ€.\n1. μμΉ¨μ μΌμ° μΌμ΄λμ μ μ£Όμ곡μμμ μμΉ¨ ν΄λμ΄λ₯Ό 보쩰 μΈμ¬λ₯Ό λ립λλ€.\n2. μ곡μμ λμλ€λλ©° μμ°μ μλ¦λ€μμ λ§λ½ν©λλ€. νΉν, μ©λ보 νν¬λ₯Ό 건λ λ€λλ©° λ©μ§ κ²½μΉλ₯Ό κ°μν©λλ€.\n3. μ€ν 1μμ―€ μ μ£Όμμ μ λͺ
ν ν₯κΈ°λ₯Ό λ§‘μ μ μλ μ±μ°μΌμΆλ΄ κ·Όμ² νΌμ¦μ νμ΄λ³΄μΈμ. μ¬κΈ°μμλ λ
Έλλ°©, μ€νμ¬ κ°μ°, μ컀ν 컨μνΈ, νλΌμ°μ± λ°κ²¬ μ¬μ λ± ν₯λ―Έλ‘μ΄ μ²΄νμ ν μ μμ΅λλ€.\n4. μ μ£ΌνΉμ μ λ€μν ν΄μ°λ¬Ό (ν΄μ΄, κΉμΉ, ν΄μ λ±)μ ꡬ경νκ³ μΆλ€λ©΄, μμ£Όμ§λ€λ―Έλ μ μ£Όμμ μ ν΅μμ₯μ λ°©λ¬Έν΄λ³΄μΈμ. ν΄μ°λ¬Ό μ¬μ°° κ·Όμ²μ μμΉν νΉμμμ₯μμλ μ μ£Όκ°κ·€μ λ§λ³Ό μ μμ΅λλ€.\n5. λ§μ§λ§μΌλ‘ μ λ
μλ μ±μ°μΌμΆλ΄μμ νλΌμ°μ μΌμΆμ λ³Ό μ μμ΅λλ€. μΌμΆμ κ°μνλ©° κ·Έ μλ¦λ€μμ λν κ°μ¬λ₯Ό ννν©λλ€.\n\nμ΄μ μ μ£ΌνΉλ³μ λ§€λ ₯μ μ¦κΈ°μ€ μ€λΉκ° λμ
¨λμ? νλ μΌμμμ λ²μ΄λ μ¬μ λ‘μμ λλ μ μλ μ μ£Όλ λ°μ΄νΈ μ½μ€λ₯Ό μ¦κΈ°λ³΄μΈμ.'
```
## Fine-tune
komt-llama2 λͺ¨λΈμ νμ΅μν€λ λ°©λ²μ μ 곡ν©λλ€.
λ
Όλ¬Έκ³Ό λ°°ν¬ν λͺ¨λΈμ μ¬μ©ν λ°μ΄ν°μ
μ€ λΌμ΄μΌμ€κ° μλ KorQuAD 1.0 λ°μ΄ν°μ
μ datasetsμ μΆκ°νμ΅λλ€.
λ
Όλ¬Έμ λν μμΈν λ΄μ©μ μλ Korean Multi-task Instruction Tuning λ₯Ό μ°Έκ³ νμΈμ.
### Fine-tune with lora

λ¨Όμ githubμμ μ½λλ₯Ό λ°μν ν¨ν€μ§λ₯Ό μ€μΉν©λλ€.(μ setupμ°Έμ‘°)
finetune_with_lora.pyλ custom datasetμ μ΄μ©νμ¬ λͺ¨λΈ νμ΅μ μν μ½λμ
λλ€.
κΈ°λ³Έμ μΌλ‘ μλμ κ°μ΄ argumentκ° μμκ²½μ° defaultλ‘ davidkim205/komt-llama2-7b-v1λͺ¨λΈμ baseλ‘ [komt_squad.json](datasets%2Fkomt_squad.json)λ‘ νμ΅μ΄ μ§νλ©λλ€.
```
python finetune_with_lora.py
```
λͺ¨λΈμ΄λ dataset μ΄λ batchsizeλ±μ μλμ κ°μ΄ μμ μ΄ κ°λ₯ν©λλ€.
```
python finetune_with_lora.py --model_name_or_path davidkim205/komt-llama2-7b-v1 --data_path datasets/komt_squad.json --num_train_epochs 1 --per_device_train_batch_size 1 --learning_rate 1e-5
```
λ³΄λ€ μμΈν argumentμ λν μμΈν μ€λͺ
μ `python finetune_with_lora.py -h` νμΈνμΈμ.
#### finetune 8-bit models with Low Rank Adaption (LoRA)
finetune_with_lora.pyλ κΈ°λ³Έμ μΌλ‘ 4-bitλ‘ μμννμ¬ νμ΅μ ν©λλ€.
8bitλ‘ μμνν κ²½μ° μλμ κ°μ΄ μ¬μ©νλ©΄ λ©λλ€.
```
python finetune_with_lora.py --bits 8
```
### Fine-tune with deepspeed
finetune_with_ds.pyμ DeepSpeedκΈ°λ°μΌλ‘ ZeRO-3 Offloadμ μ¬μ©νμ¬ νμ΅μ ν©λλ€.
CPU Offloadingμ ν΅νμ¬ GPU λ©λͺ¨λ¦¬ μ¬μ©λμ μ€μ§λ§ CPU λ©λͺ¨λ¦¬λ₯Ό μ¬μ©νκΈ°λλ¬Έμ hw μ¬μμ λ§κ² μ‘°μ μ ν΄μΌν©λλ€.
deepspeed νμΌμ configs/deepseed_config.jsonμ μΆκ°νμμ΅λλ€.
deepspeedλ₯Ό μ΄μ©ν κ²½μ° μλμ κ°μ΄ conda νκ²½μ μΆκ°νλ€μ ν΄λΉ ν¨ν€μ§λ₯Ό μ€μΉν΄μΌ ν©λλ€.
```
conda create -n ds python=3.10
conda activate ds
pip install -r requirements_ds.txt
```
finetune_with_deepspeed μ¬μ©λ°©λ²μ μλμ κ°μ΅λλ€.
```
deepspeed finetune_with_ds.py
```
argument μμ μ μλλ₯Ό μ°Έκ³ νμΈμ.
```
deepspeed finetune_with_ds.py --model_name_or_path davidkim205/komt-llama2-7b-v1 --data_path datasets/komt_squad.json --num_train_epochs 1 --per_device_train_batch_size 1 --learning_rate 1e-5 --deepspeed configs/deepspeed_config.json
```
### Fine-tune with Direct Preference Optimization (DPO)
μμ©μλΉμ€λ₯Ό μν Direct Preference Optimizationλ₯Ό μ΄μ©νμ¬ λͺ¨λΈ νμ΅ν μ μλλ‘ train μ½λμ λͺ¨λΈμ 곡κ°ν©λλ€.
DPO νμ΅μ΄ μλλ €λ©΄ SFTλ₯Ό μν΄μΌ νλλ° μ΄λ―Έ νμ΅λ komtλ₯Ό μ΄μ©νμ¬ λͺ¨λΈμ νμ΅νμκ³ , κΈ°μ‘΄ λͺ¨λΈλλΉ 5% μ±λ₯ν₯μμ΄ μμμΌλ©° λμΌν μ§λ¬Έμ λμΌν λ΅λ³μ ν μ μλ λͺ¨λΈμ κ°λ°νμμ΅λλ€.
νκΈ λ°μ΄ν°μ
μ maywell/ko_Ultrafeedback_binarized μ μ¬μ©νμμ΅λλ€.
dpo_train.py λ₯Ό μ€ννκΈ° μνμ¬ requirements_dpo.txtλ₯Ό μ€μΉνμ¬μΌ ν©λλ€.
μ€μΉμμ
λλ€.
```
conda create -n dpo_train python=3.10
conda activate dpo_train
pip install -r requirements_dpo.txt
```
μ€μΉν `accelerate config`λ₯Ό μ΄μ©νμ¬ accelerate config μ€μ ν©λλ€.
```
accelerate config
```
κ·Έ νμ accelerate launchλ₯Ό ν΅νμ¬ dpo_trainμ ν©λλ€.
```
accelerate launch dpo_train.py
```
A100 1λκΈ°μ€μΌλ‘ 9μκ° μ λ 걸립λλ€.
```
warnings.warn(
0%| | 1/1000 [00:36<10:13:09, 36.83s/it]Token indices sequence length is longer than the specified maximum sequence length for this model (1069 > 1024). Running this sequence through the model will result in indexing errors
{'loss': 0.6961, 'learning_rate': 5e-05, 'rewards/chosen': 0.004012207966297865, 'rewards/rejected': 0.007965649478137493, 'rewards/accuracies': 0.515625, 'rewards/margins': -0.003953440580517054, 'logps/rejected': -222.7124481201172, 'logps/chosen': -259.6094665527344, 'logits/rejected': -2.6427276134490967, 'logits/chosen': -2.6100172996520996, 'epoch': 0.01}
2%|β | 17/1000 [09:31<8:50:11, 32.36s/it]
```
dpoμ λν μμΈν λ΄μ©μ λ€μ λ¬Έμλ₯Ό μ°Έκ³ νμΈμ. https://arxiv.org/abs/2305.18290
## νκ°κ²°κ³Ό
chatgptλ₯Ό μ΄μ©νμ¬ μ§λ¬Έκ³Ό λλ΅μλν νκ°λ₯Ό μλμ κ°μ΄ μ§ννμμ΅λλ€. λͺ¨λΈ νκ°λ₯Ό μν μ§λ¬Έκ³Ό λ΅λ³ chatgptμ νκ° κ²°κ³Όλ eval_resultsλ₯Ό μ°Έκ³ νμΈμ.
| model | score | average(0~5) | percentage |
|------------------------------------------|---------| ------------ |------------|
| gpt-3.5-turbo(close) | 147 | 3.97 | 79.45% |
| naver Cue(close) | 140 | 3.78 | 75.67% |
| clova X(close) | 136 | 3.67 | 73.51% |
| WizardLM-13B-V1.2(open) | 96 | 2.59 | 51.89% |
| Llama-2-7b-chat-hf(open) | 67 | 1.81 | 36.21% |
| Llama-2-13b-chat-hf(open) | 73 | 1.91 | 38.37% |
| nlpai-lab/kullm-polyglot-12.8b-v2(open) | 70 | 1.89 | 37.83% |
| kfkas/Llama-2-ko-7b-Chat(open) | 96 | 2.59 | 51.89% |
| beomi/KoAlpaca-Polyglot-12.8B(open) | 100 | 2.70 | 54.05% |
| **komt-llama2-7b-v1 (open)(ours)** | **117** | **3.16** | **63.24%** |
| **komt-llama2-13b-v1 (open)(ours)** | **129** | **3.48** | **69.72%** |
| **komt-llama-30b-v1 (open)(ours)** | **129** | **3.16** | **63.24%** |
| **komt-mistral-7b-v1 (open)(ours)** | **131** | **3.54** | **70.81%** |
| **komt-mistral-7b-v1-dpo (open)(ours)** | **142** | **3.83** | **76.75%** |
----
# Korean Multi-task Instruction Tuning
## Abstract
With the recent success of ChatGPT, numerous large language models have emerged in an attempt to catch up with ChatGPT's capabilities. However, it has become evident that these models still struggle to provide accurate responses in Korean or face challenges when generating Korean text. In this study, we introduce the multi-task instruction technique, which is based on supervised datasets from various tasks, to create training data for large language models, aiming to address these issues.
## Introduction
The recent Korean large language models, such as GPT-4-LLM, Dolly, and Vicuna, have predominantly relied on translated datasets. However, using translated datasets presents several challenges:
- Language and Cultural Differences
Languages and cultures have unique expressions, vocabularies, and grammatical structures. Using translated datasets can hinder the model's ability to understand and learn effectively due to these differences.
- Translation Errors and Semantic Distortions
Machine translations are not perfect and can introduce errors or distort the meaning of the original text. This can lead to the model learning incorrect information or failing to grasp the true meaning of the source data.
- Data Quality
The quality of translated data depends on the accuracy of the source data. If the source data is inaccurate or noisy, the translated data can suffer from the same issues.
- Word Embedding Consistency
Mapping words from different languages into a consistent embedding space can be challenging. This can result in the model failing to learn the correct relationships between words or failing to recognize semantic differences among translated words.
- Data Quantity and Diversity
Using translated foreign datasets may not provide sufficient quantity and diversity of data, depending on the language and topic domain. Obtaining the required data quantity and diversity can be challenging.
- Difficulty in Understanding Context
Translated data often fails to convey the original context accurately, making it difficult for the model to understand the real meaning and context of specific words or sentences.
- Specialized Terminology and Idiomatic Expressions
Specialized terminology and idiomatic expressions in specific fields may not be appropriately handled during translation, causing the model to perform poorly in certain subjects or domains.
- Data Bias
Translating data from various countries and cultures can introduce biases or cultural differences into the model, potentially increasing bias in the model's responses.
- Performance Degradation
When original data is translated, some information may be lost in the translation process, leading to a potential decrease in the model's performance compared to using the original data directly.
## 2. Multi-task Instruction
To address these challenges and improve dataset quality, we propose an Instruction Turning Framework (ITF) that leverages multi-task datasets and instruction tuning, inspired by Google's FLAN (Finetuned LANguage Models are zero-shot Learners) technique.
### 2.1. Multi-task Datasets
We have curated multi-task datasets based on various existing Korean datasets, specifically tailored to each task. We have avoided relying on translated datasets used in previous Korean large language models. Our dataset sources include:
- AIHub Dataset: 305,900 samples
- KISTI AI Dataset: 824,337 samples
- KorQuad Dataset: 66,181 samples
- Miscellaneous Datasets: 346,803 samples
- Total Dataset Size: 1,543,221 samples
### 2.2. Instruction Tuning
Our ITF incorporates the instruction tuning technique proposed by Google's FLAN, resulting in improved zero-shot performance.
We have publicly released the freely licensed KorQuad 1.0 dataset on GitHub. However, due to licensing policies, we cannot release the other datasets.
## 3. Evaluation
For objective model evaluation, we initially used EleutherAI's lm-evaluation-harness but obtained unsatisfactory results. Consequently, we conducted evaluations using ChatGPT, a widely used model, as described in [Self-Alignment with Instruction Backtranslation](https://arxiv.org/pdf/2308.06502.pdf) and [Three Ways of Using Large Language Models to Evaluate Chat](https://arxiv.org/pdf/2308.06259.pdf) .
| model | score | average(0~5) | percentage |
| --------------------------------------- |---------| ------------ | ---------- |
| gpt-3.5-turbo(close) | 147 | 3.97 | 79.45% |
| naver Cue(close) | 140 | 3.78 | 75.67% |
| clova X(close) | 136 | 3.67 | 73.51% |
| WizardLM-13B-V1.2(open) | 96 | 2.59 | 51.89% |
| Llama-2-7b-chat-hf(open) | 67 | 1.81 | 36.21% |
| Llama-2-13b-chat-hf(open) | 73 | 1.91 | 38.37% |
| nlpai-lab/kullm-polyglot-12.8b-v2(open) | 70 | 1.89 | 37.83% |
| kfkas/Llama-2-ko-7b-Chat(open) | 96 | 2.59 | 51.89% |
| beomi/KoAlpaca-Polyglot-12.8B(open) | 100 | 2.70 | 54.05% |
| **komt-llama2-7b-v1 (open)(ours)** | **117** | **3.16** | **63.24%** |
| **komt-llama2-13b-v1 (open)(ours)** | **129** | **3.48** | **69.72%** |
| **komt-llama-30b-v1 (open)(ours)** | **129** | **3.16** | **63.24%** |
| **komt-mistral-7b-v1 (open)(ours)** | **131** | **3.54** | **70.81%** |
## 4. Conclusion
In this study, we have proposed a method to optimize the Llama2 model for the Korean language. Experimental results demonstrate that the use of multi-task instruction outperforms other Korean-supporting Llama2 models, showcasing its superior performance. Furthermore, multi-task instruction exhibits excellent performance.
In future research, we plan to leverage multi-task instruction to develop various service models and applications.
---
# References
### Llama 2
https://github.com/facebookresearch/llama
### Llama 1
https://github.com/facebookresearch/llama/tree/llama_v1
### llama.cpp
https://github.com/ggerganov/llama.cpp
|