| | --- |
| | language: |
| | - en |
| | - ja |
| | license: apache-2.0 |
| | library_name: transformers |
| | pipeline_tag: text-generation |
| | model_type: mistral |
| | --- |
| | |
| | # Swallow-MS-7b-v0.1 |
| |
|
| | このモデルは[tokyotech-llm/Swallow-MS-7b-instruct-v0.1](https://huggingface.co/tokyotech-llm/Swallow-MS-7b-instruct-v0.1/commits/main)のtokenizer.chat_templateを以下に変更したものです。 |
| | ``` |
| | tokenizer.chat_template = """{% if messages[0]['role'] == 'system' %} |
| | {% set loop_messages = messages[1:] %} |
| | {% set system_message = messages[0]['content'] %} |
| | {% elif false == true and not '<<SYS>>' in messages[0]['content'] %} |
| | {% set loop_messages = messages %} |
| | {% set system_message = 'あなたは誠実で優秀な日本人のアシスタントです。' %} |
| | {% else %} |
| | {% set loop_messages = messages %} |
| | {% set system_message = false %} |
| | {% endif %} |
| | {% if not (messages[0]['role'] == 'assistant' and loop_messages|length > 0) %} |
| | {{ bos_token }} |
| | {% endif %} |
| | {% for message in loop_messages %} |
| | {% if (message['role'] == 'user') != ((loop.index0 + (1 if messages[0]['role'] == 'assistant' else 0)) % 2 == 0) %} |
| | {{ raise_exception('Conversation roles must alternate starting from the first role.') }} |
| | {% endif %} |
| | {% if loop.index0 == 0 and system_message != false %} |
| | {% set content = '<<SYS>>\n' + system_message + '\n<</SYS>>\n\n' + message['content'] %} |
| | {% else %} |
| | {% set content = message['content'] %} |
| | {% endif %} |
| | {% if message['role'] == 'user' %} |
| | {{ '[INST] ' + content.strip() + ' [/INST] ' }} |
| | {% elif message['role'] == 'system' %} |
| | {{ '<<SYS>>\n' + content.strip() + '\n<</SYS>>\n\n' }} |
| | {% elif message['role'] == 'assistant' %} |
| | {{ '' + content.strip() + '' + eos_token }} |
| | {% endif %} |
| | {% endfor %}""" |
| | ``` |
| | 元のモデルのrevisionは`8b17f1c87697fb354952fa0d1018568e50bdff56`です。 |
| | |
| | Our Swallow-MS-7b-v0.1 model has undergone continual pre-training from the Mistral-7B-v0.1, primarily with the addition of Japanese language data. |
| | |
| | # Model Release Updates |
| | |
| | We are excited to share the release schedule for our latest models: |
| | - **April 26, 2024**: Released the [Swallow-MS-7b-instruct-v0.1](https://huggingface.co/tokyotech-llm/Swallow-MS-7b-instruct-v0.1) |
| | - **March 11, 2024**: Released the [Swallow-MS-7b-v0.1](https://huggingface.co/tokyotech-llm/Swallow-MS-7b-v0.1) |
| |  |
| | |
| | This repository provides large language models developed by [TokyoTech-LLM](https://tokyotech-llm.github.io/). |
| | |
| | ## Model Details |
| | |
| | * **Model type**: Please refer to Mistral technical report for details on the model architecture. |
| | * **Language(s)**: Japanese English |
| | * **Tokenizer**: This model employs a tokenizer that features a broadened vocabulary based on Japanese data. This allows for a more efficient representation of text using fewer tokens, leading to a notably faster inference process. |
| | * **Contact**: swallow[at]nlp.c.titech.ac.jp |
| | |
| | ## Instruct Model Performance |
| | |
| | ### MT-Bench JA |
| | |
| | #### Turn-Wise Performance |
| | |
| | We report overall (i.e., average over scores of the first and second turns), first, and second turn scores. |
| | |
| | ##### Overall |
| | |
| | |Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities| |
| | |---|---|---|---|---|---|---|---|---|---| |
| | | Swallow-MS-7b-instruct-v0.1 |0.3411|0.3770|0.4290|0.3454|0.1040|0.2400|0.3677|0.3907|0.4750| |
| | |
| | ##### First Turn |
| | |
| | |Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities| |
| | |---|---|---|---|---|---|---|---|---|---| |
| | | Swallow-MS-7b-instruct-v0.1 |0.3699|0.4880|0.4260|0.3900|0.1080|0.2364|0.3780|0.4500|0.4800| |
| | |
| | ##### Second Turn |
| | |
| | |Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities| |
| | |---|---|---|---|---|---|---|---|---|---| |
| | | Swallow-MS-7b-instruct-v0.1 |0.3130|0.2624|0.4320|0.2996|0.1000|0.2430|0.3564|0.3291|0.4700| |
| | |
| | #### Comparison to the past model |
| | |
| | We only provide the overall score in this section. |
| | |
| | |Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities| |
| | |---|---|---|---|---|---|---|---|---|---| |
| | | Swallow-MS-7b-instruct-v0.1 |0.3411|0.3770|0.4290|0.3454|0.1040|0.2400|0.3677|0.3907|0.4750| |
| | | ELYZA-japanese-Llama-2-7b-fast-instruct |0.2827|0.3289|0.3907|0.2424|0.1480|0.1584|0.3511|0.3053|0.3365| |
| | | calm2-7b-chat |0.3204|0.4657|0.4898|0.1837|0.1005|0.1414|0.3927|0.3601|0.4293| |
| | | calm2-7b-chat-dpo-experimental |0.3493|0.5312|0.5237|0.1857|0.1000|0.1813|0.3355|0.4320|0.5051| |
| | | RakutenAI-7B-instruct |0.2994|0.3623|0.3711|0.3333|0.1763|0.1581|0.4215|0.2824|0.2901| |
| | | RakutenAI-7B-chat |0.3667|0.4229|0.4644|0.3990|0.2161|0.2390|0.3416|0.3904|0.4601| |
| | |
| | |
| | ## Evaluation Benchmarks |
| | |
| | ### MT-Bench JA |
| | |
| | We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the instruction-following capabilities of models. |
| | We utilized the following settings: |
| | |
| | - Implemantation: FastChat [Zheng+, 2023] (commit #e86e70d0) |
| | - Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3) |
| | - Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1) |
| | - Prompt for Judge: [Nejumi LLM-Lederboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1) |
| | - Judge: `gpt-4-1106-preview` |
| | - Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs. |
| | |
| | |
| | ## Usage |
| | |
| | First install additional dependencies in [requirements.txt](./requirements.txt): |
| | |
| | ```sh |
| | pip install -r requirements.txt |
| | ``` |
| | |
| | ### Instruction format Ver0.1 |
| | This format must be adhered to strictly, as deviations may result in less optimal outputs from the model. |
| | |
| | The template used to construct a prompt for the Instruct model is specified as follows: |
| | |
| | ``` |
| | <s>[INST] <<SYS>>\n{SYSTEM_PROMPT}\n<</SYS>>\n\n{USER_MESSAGE_1} [/INST] {BOT_MESSAGE_1} </s>[INST] {USER_MESSAGE_2}[/INST] |
| | ``` |
| | |
| | |
| | Please be aware that ``<s>`` and ``</s>`` are special tokens used for the beginning of string (BOS) and end of string (EOS), respectively, while [INST] and [/INST] are considered regular strings. |
| | |
| | For the "{SYSTEM_PROMPT}" part, We recommend using "あなたは誠実で優秀な日本人のアシスタントです。" |
| | |
| | For the "{USER_MESSAGE_1}" part, We recommend using {instruction}\n{input} |
| | |
| | In other words, We recommend the following: |
| | |
| | ``` |
| | <s>[INST] <<SYS>>\nあなたは誠実で優秀な日本人のアシスタントです。\n<</SYS>>\n\n{instruction1}\n{input1} [/INST] {BOT_MESSAGE_1}</s>[INST] \n\n{instruction2}\n{input2} [/INST] |
| | ``` |
| | |
| | ### Use the instruct model Ver0.1 |
| | |
| | ```python |
| | import torch |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| |
|
| | model_name = "tokyotech-llm/Swallow-MS-7b-instruct-v0.1" |
| | model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto") |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | |
| | device = "cuda" |
| | |
| | messages = [ |
| | {"role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。"}, |
| | {"role": "user", "content": "東京工業大学の主なキャンパスについて教えてください"} |
| | ] |
| | |
| | encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt") |
| |
|
| | model_inputs = encodeds.to(device) |
| | model.to(device) |
| | |
| | generated_ids = model.generate(model_inputs, max_new_tokens=128, do_sample=True) |
| | decoded = tokenizer.batch_decode(generated_ids) |
| | print(decoded[0]) |
| | ``` |
| | |
| | ## Training Datasets |
| | |
| | ### Instruction Tuning Ver0.1 |
| | |
| | The following datasets were used for the instruction tuning. |
| | |
| | - [OpenAssistant Conversations Dataset](https://huggingface.co/datasets/llm-jp/oasst1-21k-ja) was used, where human utterances are included but the responses are not used. Instead, the responses were generated using the [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/datasets/llm-jp/oasst1-21k-jahttps://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) model. |
| | - [OpenAssistant Conversations Dataset 21k Ja](https://huggingface.co/datasets/llm-jp/oasst1-21k-ja) |
| | - [OpenAssistant Conversations Dataset 21k En](https://huggingface.co/datasets/llm-jp/oasst1-21k-en) |
| | - [Databricks Dolly 15k Ja](https://huggingface.co/datasets/llm-jp/databricks-dolly-15k-ja) |
| | - [Databricks Dolly 15k En](https://huggingface.co/datasets/databricks/databricks-dolly-15k) |
| | |
| | Please note that some of the data had issues with quality or format, so not all of it was used. |
| | |
| | ## Risks and Limitations |
| | |
| | The models released here are still in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations. |
| | |
| | ## Acknowledgements |
| | |
| | We thank Mistral AI for releasing Mistral 7B v0.1 under an open license for others to build on. |
| | |
| | Our project is supported by the [ABCI Large-scale Language Model Building Support Program](https://abci.ai/en/link/llm_support_program.html) of the National Institute of Advanced Industrial Science and Technology. |
| | |
| | ## License |
| | |
| | apache-2.0 |
| | |
| | ## Authors |
| | |
| | Here are the team members: |
| | - From [Okazaki Laboratory](https://www.nlp.c.titech.ac.jp/index.en.html), the following members: |
| | - [Naoaki Okazaki](https://www.chokkan.org/index.ja.html) |
| | - [Sakae Mizuki](https://s-mizuki-nlp.github.io/) |
| | - [Hiroki Iida](https://meshidenn.github.io/) |
| | - [Mengsay Loem](https://loem-ms.github.io/) |
| | - [Shota Hirai](https://huggingface.co/Kotemo428) |
| | - [Kakeru Hattori](https://aya-se.vercel.app/) |
| | - [Masanari Ohi](https://twitter.com/stjohn2007) |
| | - From [YOKOTA Laboratory](https://www.rio.gsic.titech.ac.jp/en/index.html), the following members: |
| | - [Rio Yokota](https://twitter.com/rioyokota) |
| | - [Kazuki Fujii](https://twitter.com/okoge_kaz) |
| | - [Taishi Nakamura](https://twitter.com/Setuna7777_2) |
| | - [Takumi Okamoto](https://www.linkedin.com/in/takumi-okamoto) |
| | - [Ishida Shigeki](https://www.wantedly.com/id/reborn27) |
| | |