Text Generation
Transformers
Safetensors
Japanese
English
qwen2
Merge
mergekit
qwen2.5
japanese
reasoning
slerp
conversational
text-generation-inference
Instructions to use summerMC/Sakura with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use summerMC/Sakura with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="summerMC/Sakura") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("summerMC/Sakura") model = AutoModelForCausalLM.from_pretrained("summerMC/Sakura") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use summerMC/Sakura with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "summerMC/Sakura" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "summerMC/Sakura", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/summerMC/Sakura
- SGLang
How to use summerMC/Sakura with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "summerMC/Sakura" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "summerMC/Sakura", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "summerMC/Sakura" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "summerMC/Sakura", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use summerMC/Sakura with Docker Model Runner:
docker model run hf.co/summerMC/Sakura
| language: | |
| - ja | |
| - en | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| tags: | |
| - merge | |
| - mergekit | |
| - qwen2 | |
| - qwen2.5 | |
| - japanese | |
| - reasoning | |
| - slerp | |
| - conversational | |
| license: other | |
| base_model: | |
| - SakanaAI/TinySwallow-1.5B-Instruct | |
| - WeiboAI/VibeThinker-1.5B | |
| # Sakura | |
| `summerMC/Sakura` is an experimental Japanese-oriented merged language model created from: | |
| - `SakanaAI/TinySwallow-1.5B-Instruct` | |
| - `WeiboAI/VibeThinker-1.5B` | |
| The goal of Sakura is to preserve the Japanese instruction-following and conversational behavior of TinySwallow while lightly injecting reasoning characteristics from VibeThinker. | |
| This model was created with `mergekit` using SLERP weight merging. The recommended candidate from the initial Colab search is: | |
| ```yaml | |
| merge_method: slerp | |
| parameters: | |
| t: 0.05 | |
| ``` | |
| ## Model Summary | |
| Sakura is a small 1.5B-class experimental merge model. | |
| The primary model is `SakanaAI/TinySwallow-1.5B-Instruct`, which is used for Japanese instruction-following and conversational behavior. | |
| The secondary donor model is `WeiboAI/VibeThinker-1.5B`, which is used to lightly contribute reasoning-oriented behavior. Because VibeThinker is strongly oriented toward math and algorithmic reasoning, its contribution is intentionally kept low. | |
| ## Intended Use | |
| This model is intended for: | |
| - Japanese chat and instruction following | |
| - Lightweight Japanese Q&A | |
| - Simple reasoning tasks | |
| - Simple mathematical explanations | |
| - Basic Python/code-generation prompts | |
| - Experimental research on small-model weight merging | |
| This model is not intended for: | |
| - production or mission-critical use | |
| - medical, legal, financial, or safety-critical decision making | |
| - guaranteed factual answering | |
| - high-stakes reasoning | |
| - unsupervised deployment without evaluation | |
| - replacing the original parent models | |
| ## Merge Details | |
| ### Parent Models | |
| | Role | Model | | |
| |---|---| | |
| | Primary Japanese instruction model | `SakanaAI/TinySwallow-1.5B-Instruct` | | |
| | Reasoning donor model | `WeiboAI/VibeThinker-1.5B` | | |
| ### Architecture Compatibility | |
| Both parent models are Qwen2-family causal language models and were checked before merging. | |
| Observed compatibility values: | |
| | Field | Value | | |
| |---|---| | |
| | `model_type` | `qwen2` | | |
| | `architectures` | `Qwen2ForCausalLM` | | |
| | `hidden_size` | `1536` | | |
| | `num_hidden_layers` | `28` | | |
| | `num_attention_heads` | `12` | | |
| | `num_key_value_heads` | `2` | | |
| | `intermediate_size` | `8960` | | |
| | `vocab_size` | `151936` | | |
| The tokenizer and chat template were taken from `SakanaAI/TinySwallow-1.5B-Instruct`. | |
| ### Merge Method | |
| The model was merged with `mergekit` using SLERP. | |
| A low VibeThinker ratio was selected because higher ratios caused degradation in Japanese instruction-following and repetitive English output during early experiments. | |
| Recommended merge setting: | |
| ```yaml | |
| slices: | |
| - sources: | |
| - model: SakanaAI/TinySwallow-1.5B-Instruct | |
| layer_range: [0, 28] | |
| - model: WeiboAI/VibeThinker-1.5B | |
| layer_range: [0, 28] | |
| merge_method: slerp | |
| base_model: SakanaAI/TinySwallow-1.5B-Instruct | |
| parameters: | |
| t: 0.05 | |
| dtype: bfloat16 | |
| tokenizer: | |
| source: SakanaAI/TinySwallow-1.5B-Instruct | |
| chat_template: auto | |
| ``` | |
| ## Colab / mergekit Command | |
| ```bash | |
| mergekit-yaml merge_config.yaml ./Sakura \ | |
| --cuda \ | |
| --copy-tokenizer \ | |
| --out-shard-size 2B \ | |
| --trust-remote-code | |
| ``` | |
| After merging, the config should be patched to avoid tied-weight warnings: | |
| ```python | |
| import json | |
| from pathlib import Path | |
| model_path = Path("./Sakura") | |
| config_path = model_path / "config.json" | |
| with open(config_path, "r", encoding="utf-8") as f: | |
| config = json.load(f) | |
| config["tie_word_embeddings"] = False | |
| with open(config_path, "w", encoding="utf-8") as f: | |
| json.dump(config, f, ensure_ascii=False, indent=2) | |
| ``` | |
| ## Usage | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| model_id = "summerMC/Sakura" | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| model_id, | |
| trust_remote_code=True, | |
| ) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_id, | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| trust_remote_code=True, | |
| ) | |
| if tokenizer.pad_token_id is None: | |
| tokenizer.pad_token = tokenizer.eos_token | |
| messages = [ | |
| { | |
| "role": "user", | |
| "content": "日本語で簡潔に説明してください。モデルマージとは何ですか?", | |
| } | |
| ] | |
| text = tokenizer.apply_chat_template( | |
| messages, | |
| tokenize=False, | |
| add_generation_prompt=True, | |
| ) | |
| inputs = tokenizer( | |
| text, | |
| return_tensors="pt", | |
| ).to(model.device) | |
| eos_ids = [] | |
| for token in [tokenizer.eos_token, "<|im_end|>", "<|endoftext|>"]: | |
| if token is None: | |
| continue | |
| token_id = tokenizer.convert_tokens_to_ids(token) | |
| if token_id is not None and token_id != tokenizer.unk_token_id and token_id not in eos_ids: | |
| eos_ids.append(token_id) | |
| with torch.inference_mode(): | |
| output_ids = model.generate( | |
| **inputs, | |
| max_new_tokens=256, | |
| do_sample=False, | |
| repetition_penalty=1.08, | |
| no_repeat_ngram_size=6, | |
| pad_token_id=tokenizer.pad_token_id, | |
| eos_token_id=eos_ids if eos_ids else tokenizer.eos_token_id, | |
| ) | |
| generated_ids = output_ids[0][inputs["input_ids"].shape[-1]:] | |
| print(tokenizer.decode(generated_ids, skip_special_tokens=True).strip()) | |
| ``` | |
| ## Example Prompts | |
| ```text | |
| 日本語で簡潔に説明してください。モデルマージとは何ですか? | |
| ``` | |
| ```text | |
| 12個のリンゴを3人で同じ数ずつ分けます。1人何個ですか?途中式も書いてください。 | |
| ``` | |
| ```text | |
| Pythonでフィボナッチ数列をn個返す関数を書いてください。説明は短くしてください。 | |
| ``` | |
| ```text | |
| 次の文章を自然な日本語に直してください: I went to the store because I needed some milk. | |
| ``` | |
| ## Initial Evaluation | |
| A lightweight manual evaluation was performed in Google Colab using the following prompt categories: | |
| - Japanese explanation | |
| - simple arithmetic | |
| - Python Fibonacci function generation | |
| - English-to-Japanese translation | |
| The best early candidate was around: | |
| ```text | |
| SLERP t = 0.05 | |
| ``` | |
| The evaluation was heuristic and should not be treated as a formal benchmark. More robust evaluation is recommended before publishing or using the model. | |
| Suggested future evaluations: | |
| - Japanese MT-Bench style prompts | |
| - Japanese instruction-following tests | |
| - GSM8K or Japanese arithmetic prompts | |
| - HumanEval-style Python tasks | |
| - Repetition and language-mixing checks | |
| - Safety and refusal behavior tests | |
| ## Known Limitations | |
| This is an experimental merge and may: | |
| - hallucinate facts | |
| - produce incorrect reasoning | |
| - mix English and Japanese | |
| - fail on complex mathematical tasks | |
| - produce repetitive output under some decoding settings | |
| - inherit limitations and biases from both parent models | |
| - underperform the original VibeThinker on English math/code benchmarks | |
| - underperform the original TinySwallow on some Japanese-only tasks | |
| The model should be evaluated carefully before any downstream use. | |
| ## Why the VibeThinker Ratio Is Low | |
| Early experiments with higher VibeThinker ratios caused unstable behavior, including: | |
| - loss of Japanese response behavior | |
| - repeated English assistant-style text | |
| - incorrect simple arithmetic | |
| - excessive repetition | |
| For this reason, the recommended starting range is: | |
| ```text | |
| t = 0.03 to 0.08 | |
| ``` | |
| The initial recommended value is: | |
| ```text | |
| t = 0.05 | |
| ``` | |
| ## Reproducibility | |
| Minimal reproducible merge config: | |
| ```yaml | |
| slices: | |
| - sources: | |
| - model: SakanaAI/TinySwallow-1.5B-Instruct | |
| layer_range: [0, 28] | |
| - model: WeiboAI/VibeThinker-1.5B | |
| layer_range: [0, 28] | |
| merge_method: slerp | |
| base_model: SakanaAI/TinySwallow-1.5B-Instruct | |
| parameters: | |
| t: 0.05 | |
| dtype: bfloat16 | |
| tokenizer: | |
| source: SakanaAI/TinySwallow-1.5B-Instruct | |
| chat_template: auto | |
| ``` | |
| ## Recommended Generation Settings | |
| For Japanese instruction-following: | |
| ```python | |
| generation_config = { | |
| "max_new_tokens": 256, | |
| "do_sample": False, | |
| "repetition_penalty": 1.08, | |
| "no_repeat_ngram_size": 6, | |
| } | |
| ``` | |
| For more creative or reasoning-oriented outputs: | |
| ```python | |
| generation_config = { | |
| "max_new_tokens": 512, | |
| "do_sample": True, | |
| "temperature": 0.6, | |
| "top_p": 0.95, | |
| "repetition_penalty": 1.05, | |
| } | |
| ``` | |
| ## License and Terms | |
| This merged model is derived from the following parent models: | |
| - `SakanaAI/TinySwallow-1.5B-Instruct` | |
| - `WeiboAI/VibeThinker-1.5B` | |
| Users must comply with the licenses, terms, and usage policies of all parent models and any upstream models or datasets referenced by those projects. | |
| Please review the parent model cards and licenses before use or redistribution. | |
| ## Acknowledgements | |
| This model is based on the work of: | |
| - Sakana AI | |
| - The Swallow / Japanese LLM community | |
| - WeiboAI | |
| - Qwen model developers | |
| - mergekit developers | |
| ## Citation | |
| If you use this merged model, please cite the parent models and relevant technical reports for TinySwallow, VibeThinker, Qwen2.5, and mergekit where appropriate. | |
| ## Disclaimer | |
| This model is an experimental research artifact. It is provided without warranty. The authors of this merge are not responsible for outputs generated by the model or downstream uses of the model. | |