Yi Ko 34B Instruct

Training Process

  1. Further trained with Korean corpus.
  2. SFT
  3. DPO (Dataset URL)

Model Info

Context Length Parameter Prompt Template KMMLU(5-shot)
4k(4096) 34B ChatML 49.03

Acknowledgement

The training is supported by Sionic AI.

Original Model Card by beomi

Yi-Ko series models serve as advanced iterations of 01-ai/Yi models, benefiting from an expanded vocabulary and the inclusion of Korean/English corpus in its further pretraining. Just like its predecessor, Yi-Ko series models operate within the broad range of generative text models that stretch from 6 billion to 34 billion parameters. This repository focuses on the 34B pretrained version, which is tailored to fit the Hugging Face Transformers format. For access to the other models, feel free to consult the index provided below.

Model Details

Model Developers Junbum Lee (Beomi)

Variations Yi-Ko-34B will come in a range of parameter sizes โ€” 6B and 34B โ€” with Ko(Korean+English).

Input Models input text only.

Output Models generate text only.

Model Architecture

Yi-Ko series models are an auto-regressive language model that uses an optimized transformer architecture based on Llama-2*.

*Yi model architecture is based on Llama2, so it can be loaded via LlamaForCausalLM class on HF.

Model Name Training Data Params Context Length GQA Trained Tokens LR Train tokens (per batch)
Yi-Ko-34B A mix of Korean + English online data 34B 4k O 40B+ 5e-5 4M

Vocab Expansion

Model Name Vocabulary Size Description
Original Yi-Series 64000 Sentencepiece BPE
Expanded Yi-Ko Series 78464 Sentencepiece BPE. Added Korean vocab and merges

Tokenizing "์•ˆ๋…•ํ•˜์„ธ์š”, ์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ข‹๋„ค์š”.ใ…Žใ…Ž"

Model # of tokens Tokens
Original Yi-Series 47 ['<0xEC>', '<0x95>', '<0x88>', '<0xEB>', '<0x85>', '<0x95>', 'ํ•˜', '<0xEC>', '<0x84>', '<0xB8>', '<0xEC>', '<0x9A>', '<0x94>', ',', 'โ–', '<0xEC>', '<0x98>', '<0xA4>', '<0xEB>', '<0x8A>', '<0x98>', '์€', 'โ–', '<0xEB>', '<0x82>', '<0xA0>', '<0xEC>', '<0x94>', '<0xA8>', '๊ฐ€', 'โ–', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '<0xEC>', '<0x9A>', '<0x94>', '.', '<0xE3>', '<0x85>', '<0x8E>', '<0xE3>', '<0x85>', '<0x8E>']
Expanded Yi-Ko Series 10 ['โ–์•ˆ๋…•', 'ํ•˜์„ธ์š”', ',', 'โ–์˜ค๋Š˜์€', 'โ–๋‚ ', '์”จ๊ฐ€', 'โ–์ข‹๋„ค์š”', '.', 'ใ…Ž', 'ใ…Ž']
*Equal Korean vocab with Llama-2-Ko Series

Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"

Model # of tokens Tokens
Original Yi-Series 21 ['The', 'โ–Y', 'i', 'โ–series', 'โ–models', 'โ–are', 'โ–large', 'โ–language', 'โ–models', 'โ–trained', 'โ–from', 'โ–scratch', 'โ–by', 'โ–developers', 'โ–at', 'โ–', '0', '1', '.', 'AI', '.']
Expanded Yi-Ko Series 21 ['โ–The', 'โ–Y', 'i', 'โ–series', 'โ–models', 'โ–are', 'โ–large', 'โ–language', 'โ–models', 'โ–trained', 'โ–from', 'โ–scratch', 'โ–by', 'โ–developers', 'โ–at', 'โ–', '0', '1', '.', 'AI', '.']
*Equal Korean vocab with Llama-2-Ko Series *Since Expanded Yi-Ko Series prepends _ at the beginning of the text(to ensure same tokenization for Korean sentences), it shows negilible difference for the first token on English tokenization.

Model Benchmark

LM Eval Harness - Korean Benchmarks

Tasks Version Filter n-shot Metric Value Stderr
kmmlu_direct N/A none 5 exact_match 0.5027 ยฑ 0.1019
kobest_boolq 1 none 5 acc 0.9202 ยฑ 0.0072
none 5 f1 0.9202 ยฑ N/A
kobest_copa 1 none 5 acc 0.8480 ยฑ 0.0114
none 5 f1 0.8479 ยฑ N/A
kobest_hellaswag 1 none 5 acc 0.5320 ยฑ 0.0223
none 5 f1 0.5281 ยฑ N/A
none 5 acc_norm 0.6340 ยฑ 0.0216
kobest_sentineg 1 none 5 acc 0.9874 ยฑ 0.0056
none 5 f1 0.9874 ยฑ N/A
haerae N/A none 5 acc 0.7965 ยฑ 0.0116
none 5 acc_norm 0.7965 ยฑ 0.0116
- haerae_general_knowledge 1 none 5 acc 0.5114 ยฑ 0.0378
none 5 acc_norm 0.5114 ยฑ 0.0378
- haerae_history 1 none 5 acc 0.8511 ยฑ 0.0260
none 5 acc_norm 0.8511 ยฑ 0.0260
- haerae_loan_word 1 none 5 acc 0.8402 ยฑ 0.0283
none 5 acc_norm 0.8402 ยฑ 0.0283
- haerae_rare_word 1 none 5 acc 0.8642 ยฑ 0.0170
none 5 acc_norm 0.8642 ยฑ 0.0170
- haerae_standard_nomenclature 1 none 5 acc 0.8301 ยฑ 0.0305
none 5 acc_norm 0.8301 ยฑ 0.0305

LICENSE

Follows Yi License

Citation

Acknowledgement

The training is supported by TPU Research Cloud program.

Downloads last month
18
Safetensors
Model size
35B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for maywell/Yi-Ko-34B-Instruct

Base model

beomi/Yi-Ko-34B
Finetuned
(1)
this model
Quantizations
1 model

Space using maywell/Yi-Ko-34B-Instruct 1