| | --- |
| | license: other |
| | license_name: yi-license |
| | license_link: LICENSE |
| | language: |
| | - en |
| | - ko |
| | pipeline_tag: text-generation |
| | inference: false |
| | base_model: beomi/Yi-Ko-34B |
| | tags: |
| | - pytorch |
| | - Yi-Ko |
| | - 01-ai |
| | - Yi |
| | library_name: transformers |
| | --- |
| | # Yi Ko 34B Instruct |
| |
|
| | ## Training Process |
| |
|
| | 1. Further trained with Korean corpus. |
| | 2. SFT |
| | 3. DPO [(Dataset URL)](https://huggingface.co/datasets/argilla/distilabel-capybara-dpo-7k-binarized) |
| |
|
| | ## Model Info |
| |
|
| | | Context Length | Parameter | Prompt Template | KMMLU(5-shot) | |
| | | --- | --- | --- | --- | |
| | | 4k(4096) | 34B | ChatML | 49.03 | |
| |
|
| | ## Acknowledgement |
| |
|
| | The training is supported by [Sionic AI](https://sionic.ai). |
| |
|
| | # Original Model Card by [beomi](https://huggingface.co/beomi) |
| |
|
| | Yi-Ko series models serve as advanced iterations of 01-ai/Yi models, |
| | benefiting from an expanded vocabulary and the inclusion of Korean/English corpus in its further pretraining. |
| | Just like its predecessor, Yi-Ko series models operate within the broad range of generative text models that stretch from 6 billion to 34 billion parameters. |
| | This repository focuses on the **34B** pretrained version, |
| | which is tailored to fit the Hugging Face Transformers format. |
| | For access to the other models, feel free to consult the index provided below. |
| |
|
| | ## Model Details |
| |
|
| | **Model Developers** Junbum Lee (Beomi) |
| |
|
| | **Variations** Yi-Ko-34B will come in a range of parameter sizes — 6B and 34B — with Ko(Korean+English). |
| |
|
| | **Input** Models input text only. |
| |
|
| | **Output** Models generate text only. |
| |
|
| | **Model Architecture** |
| |
|
| | Yi-Ko series models are an auto-regressive language model that uses an optimized transformer architecture based on Llama-2*. |
| | |
| | <small>*Yi model architecture is based on Llama2, so it can be loaded via `LlamaForCausalLM` class on HF.</small> |
| |
|
| | |Model Name|Training Data|Params|Context Length|GQA|Trained Tokens|LR|Train tokens (per batch)| |
| | |---|---|---|---|---|---|---|---| |
| | |Yi-Ko-34B|*A mix of Korean + English online data*|34B|4k|O|40B+|5e<sup>-5</sup>|4M| |
| |
|
| | **Vocab Expansion** |
| |
|
| | | Model Name | Vocabulary Size | Description | |
| | | --- | --- | --- | |
| | | Original Yi-Series | 64000 | Sentencepiece BPE | |
| | | **Expanded Yi-Ko Series** | 78464 | Sentencepiece BPE. Added Korean vocab and merges | |
| |
|
| | **Tokenizing "안녕하세요, 오늘은 날씨가 좋네요.ㅎㅎ"** |
| |
|
| | | Model | # of tokens | Tokens | |
| | | --- | --- | --- | |
| | | Original Yi-Series | 47 | `['<0xEC>', '<0x95>', '<0x88>', '<0xEB>', '<0x85>', '<0x95>', '하', '<0xEC>', '<0x84>', '<0xB8>', '<0xEC>', '<0x9A>', '<0x94>', ',', '▁', '<0xEC>', '<0x98>', '<0xA4>', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '<0xEB>', '<0x82>', '<0xA0>', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '<0xEC>', '<0x9A>', '<0x94>', '.', '<0xE3>', '<0x85>', '<0x8E>', '<0xE3>', '<0x85>', '<0x8E>']` | |
| | | **Expanded Yi-Ko Series** | 10 | `['▁안녕', '하세요', ',', '▁오늘은', '▁날', '씨가', '▁좋네요', '.', 'ㅎ', 'ㅎ']` | |
| | |<small>*Equal Korean vocab with Llama-2-Ko Series</small>|| |
| | |
| | **Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"** |
| | |
| | | Model | # of tokens | Tokens | |
| | | --- | --- | --- | |
| | | Original Yi-Series | 21 | `['The', '▁Y', 'i', '▁series', '▁models', '▁are', '▁large', '▁language', '▁models', '▁trained', '▁from', '▁scratch', '▁by', '▁developers', '▁at', '▁', '0', '1', '.', 'AI', '.']` | |
| | | **Expanded Yi-Ko Series** | 21 | `['▁The', '▁Y', 'i', '▁series', '▁models', '▁are', '▁large', '▁language', '▁models', '▁trained', '▁from', '▁scratch', '▁by', '▁developers', '▁at', '▁', '0', '1', '.', 'AI', '.']` | |
| | |<small>*Equal Korean vocab with Llama-2-Ko Series</small>| | <small>*Since **Expanded Yi-Ko Series** prepends `_` at the beginning of the text(to ensure same tokenization for Korean sentences), it shows negilible difference for the first token on English tokenization. </small>| |
| | |
| | # **Model Benchmark** |
| | |
| | ## LM Eval Harness - Korean Benchmarks |
| | |
| | | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr| |
| | |----------------|------:|------|-----:|--------|-----:|---|------| |
| | |**kmmlu_direct**|N/A |none | 5|exact_match|**0.5027**|± |0.1019| |
| | |kobest_boolq | 1|none | 5|acc |0.9202|± |0.0072| |
| | | | |none | 5|f1 |0.9202|± |N/A | |
| | |kobest_copa | 1|none | 5|acc |0.8480|± |0.0114| |
| | | | |none | 5|f1 |0.8479|± |N/A | |
| | |kobest_hellaswag| 1|none | 5|acc |0.5320|± |0.0223| |
| | | | |none | 5|f1 |0.5281|± |N/A | |
| | | | |none | 5|acc_norm|0.6340|± |0.0216| |
| | |kobest_sentineg | 1|none | 5|acc |0.9874|± |0.0056| |
| | | | |none | 5|f1 |0.9874|± |N/A | |
| | |haerae |N/A |none | 5|acc |0.7965|± |0.0116| |
| | | | |none | 5|acc_norm|0.7965|± |0.0116| |
| | | - haerae_general_knowledge | 1|none | 5|acc |0.5114|± |0.0378| |
| | | | |none | 5|acc_norm|0.5114|± |0.0378| |
| | | - haerae_history | 1|none | 5|acc |0.8511|± |0.0260| |
| | | | |none | 5|acc_norm|0.8511|± |0.0260| |
| | | - haerae_loan_word | 1|none | 5|acc |0.8402|± |0.0283| |
| | | | |none | 5|acc_norm|0.8402|± |0.0283| |
| | | - haerae_rare_word | 1|none | 5|acc |0.8642|± |0.0170| |
| | | | |none | 5|acc_norm|0.8642|± |0.0170| |
| | | - haerae_standard_nomenclature| 1|none | 5|acc |0.8301|± |0.0305| |
| | | | |none | 5|acc_norm|0.8301|± |0.0305| |
| | |
| | ## LICENSE |
| | |
| | Follows Yi License |
| | |
| | ## Citation |
| | |
| | |
| | |
| | ## Acknowledgement |
| | |
| | The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program. |