init commit
Browse files- .gitattributes +1 -0
- .gitignore +8 -0
- CAT-logo.png +3 -0
- LICENSE +21 -0
- README.md +143 -0
- TRAINING.md +121 -0
- config.json +30 -0
- model.safetensors +3 -0
- special_tokens_map.json +51 -0
- test.py +22 -0
- tokenizer.json +0 -0
- tokenizer.model +3 -0
- tokenizer_config.json +172 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
*.png filter=lfs diff=lfs merge=lfs -text
|
.gitignore
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
.mypy_cache/
|
| 2 |
+
__pycache__/
|
| 3 |
+
.ipynb_checkpoints/
|
| 4 |
+
env/
|
| 5 |
+
venv/
|
| 6 |
+
*.pyc
|
| 7 |
+
*.pyo
|
| 8 |
+
*.pyd
|
CAT-logo.png
ADDED
|
Git LFS Details
|
LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
MIT License
|
| 2 |
+
|
| 3 |
+
Copyright (c) 2026 CyberAgent AI Lab
|
| 4 |
+
|
| 5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
| 6 |
+
of this software and associated documentation files (the "Software"), to deal
|
| 7 |
+
in the Software without restriction, including without limitation the rights
|
| 8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
| 9 |
+
copies of the Software, and to permit persons to whom the Software is
|
| 10 |
+
furnished to do so, subject to the following conditions:
|
| 11 |
+
|
| 12 |
+
The above copyright notice and this permission notice shall be included in all
|
| 13 |
+
copies or substantial portions of the Software.
|
| 14 |
+
|
| 15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
| 16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
| 17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
| 18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
| 19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
| 20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
| 21 |
+
SOFTWARE.
|
README.md
ADDED
|
@@ -0,0 +1,143 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CAT-Translate 🐱
|
| 2 |
+
|
| 3 |
+
[](https://opensource.org/licenses/MIT)
|
| 4 |
+
[](https://huggingface.co/cyberagent/CAT-Translate-0.8b/)
|
| 5 |
+
|
| 6 |
+
Tiny Language Model For Japanese and English Bidirectional Translation
|
| 7 |
+
|
| 8 |
+
- **Purrs on your lap** 🐱: Small and efficient! 0.8-3.3B models that run on edge devices.
|
| 9 |
+
- **Swift and Feline Sharp** 🐾: Beats TranslateGemma-12B on text-to-text translation quality.
|
| 10 |
+
- **Adopt and adapt** 🐈: Open source (MIT License) models you can customize and extend.
|
| 11 |
+
|
| 12 |
+
<div align="center">
|
| 13 |
+
<img src="CAT-logo.png" alt="Cat sleeping on top of a laptop." width="200">
|
| 14 |
+
</div>
|
| 15 |
+
|
| 16 |
+
## Models
|
| 17 |
+
|
| 18 |
+
All models are available on Hugging Face:
|
| 19 |
+
|
| 20 |
+
- [CAT-Translate-0.8B](https://huggingface.co/cyberagent/CAT-Translate-0.8b/)
|
| 21 |
+
- [CAT-Translate-1.4B](https://huggingface.co/cyberagent/CAT-Translate-1.4b/)
|
| 22 |
+
- [CAT-Translate-3.3B (In preparation)](https://huggingface.co/cyberagent/CAT-Translate-3.3b/)
|
| 23 |
+
|
| 24 |
+
## Evaluation
|
| 25 |
+
|
| 26 |
+
We conducted evaluation on the translation subsets of the following benchmarks:
|
| 27 |
+
|
| 28 |
+
- [The Business Scene Dialogue corpus](https://github.com/tsuruoka-lab/BSD) (BSD)
|
| 29 |
+
- Each conversation is given to the model to translate instead of each sentence.
|
| 30 |
+
- [Court Interpreter](https://github.com/mynlp/court_interpreter) (Court)
|
| 31 |
+
- [JMedBench](https://huggingface.co/datasets/Coldog2333/JMedBench) (JMed)
|
| 32 |
+
- ejmmt subsets are used.
|
| 33 |
+
- [pfmt-bench-fin-ja](https://github.com/pfnet-research/pfmt-bench-fin-ja) (PFMT)
|
| 34 |
+
- [WAT 2025 Patent Translation](https://sites.google.com/view/pat-claims-trans-2025/) (wat-pat-2025)
|
| 35 |
+
|
| 36 |
+
We chose these tasks as benchmarks because (1) they are derived from real world applications and (2) are less overoptimized compared to popular datasets (e.g., WMT).
|
| 37 |
+
|
| 38 |
+
The results are below.
|
| 39 |
+
Overall, our 1.4B model achieved the best overall scores.
|
| 40 |
+
The 0.8B, 1.4B, and 3.3B-beta models achieved the best scores among all models (including closed source) within their respective sizes for both En-Ja and Ja-En translation tasks.
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
| Model | Avg. BLEU | Avg. BLEU Ja->En | Avg. BLEU En->Ja | BSD (Ja-En) | Court (Ja-En) | JMed (Ja-En) | PFMT (Ja-En) | wat-pat-2025 (Ja-En) | BSD (En-Ja) | JMed (En-Ja) | PFMT (En-Ja) | wat-pat-2025 (En-Ja) |
|
| 44 |
+
|:-------------------------------------------------|----------:|-----------------:|-----------------:|------------:|--------------:|-------------:|-------------:|------------------:|------------:|-------------:|-------------:|------------------:|
|
| 45 |
+
| CyberAgent/CAT-Translate-1.4B | 33.73 | 33.26 | 34.19 | 31.28 | 43.84 | 24.08 | 36.55 | 30.57 | 15.71 | 26.92 | 51.53 | 42.58 |
|
| 46 |
+
| Unbabel/Tower-Plus-9B | 32.41 | 36.84 | 27.99 | 15.43 | 40.54 | 29.13 | 58.00 | 41.10 | 10.00 | 18.80 | 53.00 | 30.16 |
|
| 47 |
+
| google/translategemma-12b-it | 32.24 | 35.81 | 28.68 | 31.58 | 34.30 | 23.46 | 48.75 | 40.97 | 15.92 | 21.79 | 52.53 | 24.47 |
|
| 48 |
+
| CyberAgent/CAT-Translate-3.3B-beta | 30.60 | 30.32 | 30.88 | 17.20 | 38.65 | 23.96 | 40.58 | 31.22 | 16.63 | 26.68 | 53.40 | 26.80 |
|
| 49 |
+
| CyberAgent/CAT-Translate-0.8B | 30.42 | 29.71 | 30.68 | 29.63 | 33.19 | 22.96 | 32.51 | 30.56 | 14.60 | 26.22 | 50.62 | 32.87 |
|
| 50 |
+
| google/translategemma-4b-it | 28.09 | 29.41 | 26.76 | 28.86 | 25.89 | 21.50 | 42.65 | 28.16 | 14.14 | 20.68 | 51.99 | 20.23 |
|
| 51 |
+
| LiquidAI/LFM2.5-1.2B-JP | 25.47 | 24.51 | 26.43 | 19.06 | 29.99 | 22.10 | 43.61 | 7.80 | 14.57 | 23.85 | 54.77 | 12.54 |
|
| 52 |
+
| pfnet/plamo-2-translate | 25.24 | 25.92 | 24.57 | 25.55 | 28.63 | 22.90 | 29.02 | 23.48 | 17.35 | 24.98 | 32.04 | 23.89 |
|
| 53 |
+
| LiquidAI/LFM2-350M-ENJP-MT | 24.95 | 24.91 | 25.00 | 10.94 | 29.56 | 21.48 | 41.40 | 21.17 | 8.11 | 22.84 | 47.53 | 21.52 |
|
| 54 |
+
| mistralai/Ministral-8B-Instruct-2410 | 24.12 | 27.52 | 20.71 | 19.23 | 29.21 | 16.25 | 50.23 | 22.69 | 12.91 | 16.49 | 41.66 | 11.80 |
|
| 55 |
+
| Rakuten/RakutenAI-2.0-mini-instruct | 18.43 | 17.24 | 19.62 | 0.11 | 30.62 | 18.21 | 29.34 | 7.90 | 5.19 | 20.36 | 45.70 | 7.23 |
|
| 56 |
+
| SakanaAI/TinySwallow-1.5B-Instruct | 15.74 | 14.99 | 16.49 | 4.96 | 18.93 | 15.83 | 26.67 | 8.58 | 6.30 | 17.58 | 34.07 | 8.00 |
|
| 57 |
+
| llm-jp/llm-jp-3.1-1.8b-instruct4 | 15.18 | 16.26 | 14.11 | 18.82 | 2.44 | 15.67 | 30.65 | 13.72 | 15.38 | 4.91 | 25.47 | 10.65 |
|
| 58 |
+
| tencent/HY-MT1.5-1.8B | 14.49 | 8.95 | 20.04 | 5.50 | 4.59 | 4.00 | 15.67 | 14.98 | 6.33 | 18.13 | 37.75 | 17.96 |
|
| 59 |
+
| shisa-ai/shisa-v2.1-llama3.2-3b | 14.27 | 14.26 | 14.28 | 17.08 | 3.70 | 8.26 | 26.86 | 15.42 | 13.18 | 5.54 | 25.97 | 12.41 |
|
| 60 |
+
| google/gemma-2-2b-jpn-it | 14.15 | 16.98 | 11.32 | 20.04 | 8.08 | 11.27 | 31.49 | 14.01 | 12.37 | 4.48 | 16.24 | 12.21 |
|
| 61 |
+
| shisa-ai/shisa-v2.1-lfm2-1.2b | 13.08 | 14.02 | 12.14 | 20.93 | 4.95 | 7.68 | 26.72 | 9.80 | 12.11 | 5.54 | 17.60 | 13.30 |
|
| 62 |
+
| microsoft/phi-4 | 11.92 | 13.48 | 10.36 | 6.10 | 18.66 | 2.81 | 24.86 | 14.98 | 3.24 | 6.97 | 14.36 | 16.87 |
|
| 63 |
+
| tencent/HY-MT1.5-7B | 10.56 | 13.46 | 7.67 | 4.99 | 12.32 | 5.72 | 29.53 | 14.76 | 0.82 | 7.80 | 14.30 | 7.74 |
|
| 64 |
+
| tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.5 | 10.35 | 12.42 | 8.28 | 24.25 | 2.30 | 3.69 | 14.11 | 17.74 | 6.82 | 2.37 | 11.21 | 12.71 |
|
| 65 |
+
| Qwen/Qwen2.5-14B-Instruct | 8.39 | 9.88 | 6.89 | 10.81 | 4.70 | 4.27 | 11.18 | 18.46 | 4.01 | 3.69 | 13.42 | 6.42 |
|
| 66 |
+
| meta-llama/Llama-3.2-3B-Instruct | 6.06 | 9.90 | 2.23 | 18.60 | 0.41 | 2.72 | 16.62 | 11.17 | 1.44 | 1.10 | 4.50 | 1.87 |
|
| 67 |
+
|
| 68 |
+
A detailed experimental evaluation will be present in a technical report.
|
| 69 |
+
|
| 70 |
+
## Usage
|
| 71 |
+
|
| 72 |
+
The model supports English to Japanese and Japanese to English translation with the following prompt format:
|
| 73 |
+
|
| 74 |
+
```python
|
| 75 |
+
from transformers import pipeline
|
| 76 |
+
|
| 77 |
+
# Load the model
|
| 78 |
+
chat_pipeline = pipeline("text-generation", model="CyberAgent/CAT-Translate-0.8b")
|
| 79 |
+
|
| 80 |
+
# Define the prompt template
|
| 81 |
+
prompt = "Translate the following {src_lang} text into {tgt_lang}.\n\n{src_text}"
|
| 82 |
+
|
| 83 |
+
# Example: Japanese to English
|
| 84 |
+
src_lang = "Japanese"
|
| 85 |
+
tgt_lang = "English"
|
| 86 |
+
src_text = "🐈はとてもかわいいの。おててがまるくてふわふわなの。"
|
| 87 |
+
|
| 88 |
+
user_input = [{"role": "user", "content": prompt.format(src_lang=src_lang, tgt_lang=tgt_lang, src_text=src_text)}]
|
| 89 |
+
|
| 90 |
+
response = chat_pipeline(user_input)
|
| 91 |
+
|
| 92 |
+
print("-" * 20)
|
| 93 |
+
print("Source Text:")
|
| 94 |
+
print(src_text)
|
| 95 |
+
print("Translation:")
|
| 96 |
+
print(response[0]['generated_text'][-1]['content'])
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
**Important**: You need to apply the chat template to run the model correctly. The template is the same as [sarashina2.2-0.5b-instruct-v0.1](https://huggingface.co/sbintuitions/sarashina2.2-0.5b-instruct-v0.1).
|
| 100 |
+
|
| 101 |
+
### Why Use Instructions?
|
| 102 |
+
|
| 103 |
+
Although the model is specialized for machine translation, we require an instruction prompt to invoke the translation capability. This design choice provides better customizability—extending and merging this model is easier this way. Since the model is open source, any extensions are welcome!
|
| 104 |
+
|
| 105 |
+
## Training
|
| 106 |
+
|
| 107 |
+
We used the [sarashina2.2 series](https://huggingface.co/collections/sbintuitions/sarashina22) ([MIT LICENSE](https://huggingface.co/sbintuitions/sarashina2.2-0.5b/blob/main/LICENSE)) as our pretrained model. While Qwen-3 showed higher benchmark scores, we found that sarashina generated more natural Japanese text that avoided "translationese" patterns. We hypothesized that naturalness is more difficult to learn than translation accuracy, leading us to choose sarashina as our base model.
|
| 108 |
+
|
| 109 |
+
Our training process involved:
|
| 110 |
+
- Synthesizing parallel corpora from monolingual data using large language models
|
| 111 |
+
- Two-stage supervised fine-tuning (SFT) approach
|
| 112 |
+
- Reinforcement learning with [Multi-Objective GRPO (Ichihara et al. 2025)](https://arxiv.org/abs/2509.22047)
|
| 113 |
+
- LoRA for efficient training
|
| 114 |
+
|
| 115 |
+
For detailed information about our training methodology, data preparation, and technical specifications, please see [TRAINING.md](TRAINING.md).
|
| 116 |
+
|
| 117 |
+
## License
|
| 118 |
+
|
| 119 |
+
The model is licensed under the [MIT License](LICENSE).
|
| 120 |
+
|
| 121 |
+
## Citation
|
| 122 |
+
|
| 123 |
+
```bibtex
|
| 124 |
+
@misc{cat-translate-2026,
|
| 125 |
+
title={CAT-Translate: Tiny Language Model For Japanese and English Bidirectional Translation},
|
| 126 |
+
author={Yuu Jinnai},
|
| 127 |
+
year={2026},
|
| 128 |
+
url={https://huggingface.co/cyberagent/CAT-Translate-0.8b}
|
| 129 |
+
}
|
| 130 |
+
```
|
| 131 |
+
|
| 132 |
+
## Acknowledgments
|
| 133 |
+
|
| 134 |
+
This project stands on the shoulders of giants. In particular, the following resources significantly helped us develop the model:
|
| 135 |
+
|
| 136 |
+
- [sarashina](https://huggingface.co/sbintuitions) by SB Intuitions
|
| 137 |
+
- [gpt-oss](https://huggingface.co/openai/gpt-oss-20b) by OpenAI
|
| 138 |
+
- [MetricX](https://huggingface.co/google/metricx-24-hybrid-xl-v2p6-bfloat16) by Juraj Juraska et al.
|
| 139 |
+
- [Duplodocus](https://github.com/allenai/duplodocus) by AllenAI
|
| 140 |
+
- [fastText](https://github.com/facebookresearch/fastText) by Facebook Research
|
| 141 |
+
- [COMET](https://huggingface.co/Unbabel/wmt22-comet-da) by Ricardo Rei et al.
|
| 142 |
+
- [sacrebleu](https://github.com/mjpost/sacrebleu) by Matt Post
|
| 143 |
+
- Mitsuki Sakamoto for deploying the model with UI for internal testing
|
TRAINING.md
ADDED
|
@@ -0,0 +1,121 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Training Details
|
| 2 |
+
|
| 3 |
+
This document provides detailed information about the training methodology used to develop the CAT-Translate models.
|
| 4 |
+
The details will be available in a technical report.
|
| 5 |
+
|
| 6 |
+
## Table of Contents
|
| 7 |
+
|
| 8 |
+
- [Training Data](#training-data)
|
| 9 |
+
- [Supervised Fine-Tuning](#supervised-fine-tuning)
|
| 10 |
+
- [Reinforcement Learning](#reinforcement-learning)
|
| 11 |
+
- [LoRA Configuration](#lora-configuration)
|
| 12 |
+
|
| 13 |
+
## Training Data
|
| 14 |
+
|
| 15 |
+
We synthesized parallel corpora from monolingual data using large language models. For generating translations, we used:
|
| 16 |
+
|
| 17 |
+
- **DeepSeek-V3**: Used for initial prototyping only, not used for the rest of the development.
|
| 18 |
+
- **gpt-oss-20b**: Generated most of the data, providing sufficiently high quality for many instances.
|
| 19 |
+
- **gpt-oss-120b**: Used for domains where gpt-oss-20b was not satisfactory (e.g., scientific abstracts).
|
| 20 |
+
|
| 21 |
+
### Data Filtering
|
| 22 |
+
|
| 23 |
+
The synthesized data were filtered according to several criteria:
|
| 24 |
+
|
| 25 |
+
- Texts written mostly in languages other than Japanese and English
|
| 26 |
+
- The ratio of Japanese and English text lengths being too large or too small
|
| 27 |
+
- Duplicated content using MinHash ([Duplodocus](https://github.com/allenai/duplodocus))
|
| 28 |
+
- Low quality according to BLEU score and/or COMET score ([comet-qe](https://huggingface.co/Unbabel/wmt22-comet-da))
|
| 29 |
+
- Manually identified low quality texts
|
| 30 |
+
- Rule-based algorithms manually coded to filter low quality texts identified by hand
|
| 31 |
+
|
| 32 |
+
## Supervised Fine-Tuning
|
| 33 |
+
|
| 34 |
+
We applied a two-stage fine-tuning approach.
|
| 35 |
+
|
| 36 |
+
### First Stage: Focus on Diversity
|
| 37 |
+
|
| 38 |
+
The first stage focused on diversity of prompts. The dataset consisted of:
|
| 39 |
+
|
| 40 |
+
- Mostly web-crawled data with relatively low quality translation
|
| 41 |
+
- Some portion from targeted domains including:
|
| 42 |
+
- Scientific abstracts (arXiv and PubMed)
|
| 43 |
+
- Patents (USPTO)
|
| 44 |
+
- Most instances were sentence-long, with some paragraph-long instances
|
| 45 |
+
|
| 46 |
+
**Key Finding**: We found that the performance of the models mostly saturated with this corpus at around 100k instances. This led us to prepare a more challenging and higher quality dataset for the second stage.
|
| 47 |
+
|
| 48 |
+
### Second Stage: Focus on Quality
|
| 49 |
+
|
| 50 |
+
The second stage focused on the quality of generated translations. Key characteristics:
|
| 51 |
+
|
| 52 |
+
- Large portion of data instances generated by **gpt-oss-120b**
|
| 53 |
+
- Focus areas:
|
| 54 |
+
- Scientific abstracts (arXiv and PubMed)
|
| 55 |
+
- Patents (USPTO)
|
| 56 |
+
- Underspecified/misspecified text (e.g., input with typo)
|
| 57 |
+
- Most instances were paragraph-long to multiple-paragraph-long
|
| 58 |
+
- Kept some data from the first stage corpus to maintain diversity
|
| 59 |
+
|
| 60 |
+
## Reinforcement Learning
|
| 61 |
+
|
| 62 |
+
We used the same corpus as the second stage of SFT. The model was trained with **Multi-Objective GRPO** (Ichihara et al. 2025).
|
| 63 |
+
|
| 64 |
+
### Primary Reward Model: MetricX-24
|
| 65 |
+
|
| 66 |
+
We chose [MetricX-24](https://huggingface.co/google/metricx-24-hybrid-xl-v2p6-bfloat16) as our primary reward model for the following reasons:
|
| 67 |
+
|
| 68 |
+
- Open source
|
| 69 |
+
- Faster than LLM-as-a-Judge models
|
| 70 |
+
- High agreement with human judgments
|
| 71 |
+
|
| 72 |
+
We also considered using gpt-oss-120b as a judge, which has very high accuracy. However, it requires significantly more computational resources that were not available under our constraints.
|
| 73 |
+
|
| 74 |
+
### MetricX Limitations
|
| 75 |
+
|
| 76 |
+
Like all reward models, MetricX has several misspecifications that generation models may exploit:
|
| 77 |
+
|
| 78 |
+
1. **Language-agnostic**: Being multilingual, it gives scores regardless of output language, even when the task requires generating Japanese text.
|
| 79 |
+
2. **Format-agnostic**: Syntactic characters such as newlines (`\n`) and markdown syntax (e.g., `*`, `#`) are ignored.
|
| 80 |
+
3. **Allows hallucination**: MetricX is relatively tolerant to hallucination as long as the output text contains the information in the input text. This is not ideal for training language model-based machine translation systems.
|
| 81 |
+
|
| 82 |
+
### Auxiliary Reward Functions
|
| 83 |
+
|
| 84 |
+
To remedy these problems, we implemented auxiliary reward functions:
|
| 85 |
+
|
| 86 |
+
#### 1. BLEU Score (Weight: 0.1)
|
| 87 |
+
|
| 88 |
+
Used to compute lexical overlap with reference text. Expected to be effective for:
|
| 89 |
+
- Avoiding overoptimization to the other reward model
|
| 90 |
+
- Giving reward to accurate translation of technical terms
|
| 91 |
+
|
| 92 |
+
#### 2. Format Consistency
|
| 93 |
+
|
| 94 |
+
Texts too different in format are penalized. This addresses the issue where models often generate markdown-formatted text even when the input is plain text.
|
| 95 |
+
|
| 96 |
+
#### 3. Length Penalty
|
| 97 |
+
|
| 98 |
+
Texts that are too long or too short are penalized. This suppressed many hallucinations generated by the models.
|
| 99 |
+
|
| 100 |
+
### Reward Normalization Strategy
|
| 101 |
+
|
| 102 |
+
- **MetricX and BLEU** (weights 1.0 and 0.1): Applied with normalization to compute advantages
|
| 103 |
+
- **Rationale**: Translation quality is difficult to learn, and training pedagogically with relative advantage makes sense
|
| 104 |
+
|
| 105 |
+
- **Format consistency and length penalty**: Applied absolutely without normalization (as in Dr. GRPO)
|
| 106 |
+
- **Rationale**: These are easy to learn by themselves and exist to keep the model from getting out of control. The penalties should be large enough to prevent violations regardless of translation quality improvements, and the model should be able to learn them. Thus, we penalize with a large absolute value rather than a relative value.
|
| 107 |
+
|
| 108 |
+
## LoRA Configuration
|
| 109 |
+
|
| 110 |
+
We used LoRA (Low-Rank Adaptation) to reduce computational resource requirements.
|
| 111 |
+
|
| 112 |
+
| Model Size | LoRA Usage |
|
| 113 |
+
|------------|-----------|
|
| 114 |
+
| 0.5B | Not used |
|
| 115 |
+
| 1B | For GRPO |
|
| 116 |
+
| 3B | For the second stage of SFT and GRPO |
|
| 117 |
+
| 7B | For all processes |
|
| 118 |
+
|
| 119 |
+
---
|
| 120 |
+
|
| 121 |
+
For the main project overview, see [README.md](README.md).
|
config.json
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"LlamaForCausalLM"
|
| 4 |
+
],
|
| 5 |
+
"attention_bias": false,
|
| 6 |
+
"attention_dropout": 0.0,
|
| 7 |
+
"bos_token_id": 1,
|
| 8 |
+
"dtype": "bfloat16",
|
| 9 |
+
"eos_token_id": 2,
|
| 10 |
+
"head_dim": 80,
|
| 11 |
+
"hidden_act": "silu",
|
| 12 |
+
"hidden_size": 1280,
|
| 13 |
+
"initializer_range": 0.02,
|
| 14 |
+
"intermediate_size": 4480,
|
| 15 |
+
"max_position_embeddings": 8192,
|
| 16 |
+
"mlp_bias": false,
|
| 17 |
+
"model_type": "llama",
|
| 18 |
+
"num_attention_heads": 16,
|
| 19 |
+
"num_hidden_layers": 24,
|
| 20 |
+
"num_key_value_heads": 8,
|
| 21 |
+
"pad_token_id": 3,
|
| 22 |
+
"pretraining_tp": 1,
|
| 23 |
+
"rms_norm_eps": 1e-05,
|
| 24 |
+
"rope_scaling": null,
|
| 25 |
+
"rope_theta": 500000,
|
| 26 |
+
"tie_word_embeddings": false,
|
| 27 |
+
"transformers_version": "4.57.6",
|
| 28 |
+
"use_cache": false,
|
| 29 |
+
"vocab_size": 102400
|
| 30 |
+
}
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3375cdd5cb11dbc5036df4902cf20afbc5ffa560f51bb138bd4ffa8e8c0b10f9
|
| 3 |
+
size 1586121792
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"bos_token": {
|
| 3 |
+
"content": "<s>",
|
| 4 |
+
"lstrip": false,
|
| 5 |
+
"normalized": false,
|
| 6 |
+
"rstrip": false,
|
| 7 |
+
"single_word": false
|
| 8 |
+
},
|
| 9 |
+
"cls_token": {
|
| 10 |
+
"content": "<cls>",
|
| 11 |
+
"lstrip": false,
|
| 12 |
+
"normalized": false,
|
| 13 |
+
"rstrip": false,
|
| 14 |
+
"single_word": false
|
| 15 |
+
},
|
| 16 |
+
"eos_token": {
|
| 17 |
+
"content": "</s>",
|
| 18 |
+
"lstrip": false,
|
| 19 |
+
"normalized": false,
|
| 20 |
+
"rstrip": false,
|
| 21 |
+
"single_word": false
|
| 22 |
+
},
|
| 23 |
+
"mask_token": {
|
| 24 |
+
"content": "<mask>",
|
| 25 |
+
"lstrip": false,
|
| 26 |
+
"normalized": false,
|
| 27 |
+
"rstrip": false,
|
| 28 |
+
"single_word": false
|
| 29 |
+
},
|
| 30 |
+
"pad_token": {
|
| 31 |
+
"content": "<pad>",
|
| 32 |
+
"lstrip": false,
|
| 33 |
+
"normalized": false,
|
| 34 |
+
"rstrip": false,
|
| 35 |
+
"single_word": false
|
| 36 |
+
},
|
| 37 |
+
"sep_token": {
|
| 38 |
+
"content": "<sep>",
|
| 39 |
+
"lstrip": false,
|
| 40 |
+
"normalized": false,
|
| 41 |
+
"rstrip": false,
|
| 42 |
+
"single_word": false
|
| 43 |
+
},
|
| 44 |
+
"unk_token": {
|
| 45 |
+
"content": "<unk>",
|
| 46 |
+
"lstrip": false,
|
| 47 |
+
"normalized": false,
|
| 48 |
+
"rstrip": false,
|
| 49 |
+
"single_word": false
|
| 50 |
+
}
|
| 51 |
+
}
|
test.py
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from transformers import pipeline
|
| 2 |
+
|
| 3 |
+
# model_name = "CyberAgent/CAT-Translate-0.8b"
|
| 4 |
+
# chat_pipeline = pipeline("text-generation", model_name)
|
| 5 |
+
|
| 6 |
+
chat_pipeline = pipeline("text-generation", model=".")
|
| 7 |
+
|
| 8 |
+
prompt = "Translate the following {src_lang} text into {tgt_lang}.\n\n {src_text}"
|
| 9 |
+
|
| 10 |
+
src_lang = "Japanese"
|
| 11 |
+
tgt_lang = "English"
|
| 12 |
+
src_text = "🐈はとてもかわいいの。おててがまるくてふわふわなの。"
|
| 13 |
+
|
| 14 |
+
user_input = [{"role": "user", "content": prompt.format(src_lang=src_lang, tgt_lang=tgt_lang, src_text=src_text)}]
|
| 15 |
+
|
| 16 |
+
response = chat_pipeline(user_input)
|
| 17 |
+
|
| 18 |
+
print("-" * 20)
|
| 19 |
+
print("Source Text:")
|
| 20 |
+
print(src_text)
|
| 21 |
+
print("Translation:")
|
| 22 |
+
print(response[0]['generated_text'][-1]['content'])
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:008293028e1a9d9a1038d9b63d989a2319797dfeaa03f171093a57b33a3a8277
|
| 3 |
+
size 1831879
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,172 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"add_bos_token": false,
|
| 3 |
+
"add_dummy_prefix_space": false,
|
| 4 |
+
"add_eos_token": false,
|
| 5 |
+
"add_prefix_space": false,
|
| 6 |
+
"added_tokens_decoder": {
|
| 7 |
+
"0": {
|
| 8 |
+
"content": "<unk>",
|
| 9 |
+
"lstrip": false,
|
| 10 |
+
"normalized": false,
|
| 11 |
+
"rstrip": false,
|
| 12 |
+
"single_word": false,
|
| 13 |
+
"special": true
|
| 14 |
+
},
|
| 15 |
+
"1": {
|
| 16 |
+
"content": "<s>",
|
| 17 |
+
"lstrip": false,
|
| 18 |
+
"normalized": false,
|
| 19 |
+
"rstrip": false,
|
| 20 |
+
"single_word": false,
|
| 21 |
+
"special": true
|
| 22 |
+
},
|
| 23 |
+
"2": {
|
| 24 |
+
"content": "</s>",
|
| 25 |
+
"lstrip": false,
|
| 26 |
+
"normalized": false,
|
| 27 |
+
"rstrip": false,
|
| 28 |
+
"single_word": false,
|
| 29 |
+
"special": true
|
| 30 |
+
},
|
| 31 |
+
"3": {
|
| 32 |
+
"content": "<pad>",
|
| 33 |
+
"lstrip": false,
|
| 34 |
+
"normalized": false,
|
| 35 |
+
"rstrip": false,
|
| 36 |
+
"single_word": false,
|
| 37 |
+
"special": true
|
| 38 |
+
},
|
| 39 |
+
"4": {
|
| 40 |
+
"content": "<sep>",
|
| 41 |
+
"lstrip": false,
|
| 42 |
+
"normalized": false,
|
| 43 |
+
"rstrip": false,
|
| 44 |
+
"single_word": false,
|
| 45 |
+
"special": true
|
| 46 |
+
},
|
| 47 |
+
"5": {
|
| 48 |
+
"content": "<mask>",
|
| 49 |
+
"lstrip": false,
|
| 50 |
+
"normalized": false,
|
| 51 |
+
"rstrip": false,
|
| 52 |
+
"single_word": false,
|
| 53 |
+
"special": true
|
| 54 |
+
},
|
| 55 |
+
"6": {
|
| 56 |
+
"content": "<cls>",
|
| 57 |
+
"lstrip": false,
|
| 58 |
+
"normalized": false,
|
| 59 |
+
"rstrip": false,
|
| 60 |
+
"single_word": false,
|
| 61 |
+
"special": true
|
| 62 |
+
},
|
| 63 |
+
"7": {
|
| 64 |
+
"content": "<|system|>",
|
| 65 |
+
"lstrip": false,
|
| 66 |
+
"normalized": false,
|
| 67 |
+
"rstrip": false,
|
| 68 |
+
"single_word": false,
|
| 69 |
+
"special": false
|
| 70 |
+
},
|
| 71 |
+
"8": {
|
| 72 |
+
"content": "<|assistant|>",
|
| 73 |
+
"lstrip": false,
|
| 74 |
+
"normalized": false,
|
| 75 |
+
"rstrip": false,
|
| 76 |
+
"single_word": false,
|
| 77 |
+
"special": false
|
| 78 |
+
},
|
| 79 |
+
"9": {
|
| 80 |
+
"content": "<|user|>",
|
| 81 |
+
"lstrip": false,
|
| 82 |
+
"normalized": false,
|
| 83 |
+
"rstrip": false,
|
| 84 |
+
"single_word": false,
|
| 85 |
+
"special": false
|
| 86 |
+
},
|
| 87 |
+
"10": {
|
| 88 |
+
"content": "<|available_tools|>",
|
| 89 |
+
"lstrip": false,
|
| 90 |
+
"normalized": false,
|
| 91 |
+
"rstrip": false,
|
| 92 |
+
"single_word": false,
|
| 93 |
+
"special": false
|
| 94 |
+
},
|
| 95 |
+
"11": {
|
| 96 |
+
"content": "<|tool_calls|>",
|
| 97 |
+
"lstrip": false,
|
| 98 |
+
"normalized": false,
|
| 99 |
+
"rstrip": false,
|
| 100 |
+
"single_word": false,
|
| 101 |
+
"special": false
|
| 102 |
+
},
|
| 103 |
+
"12": {
|
| 104 |
+
"content": "<|tool_results|>",
|
| 105 |
+
"lstrip": false,
|
| 106 |
+
"normalized": false,
|
| 107 |
+
"rstrip": false,
|
| 108 |
+
"single_word": false,
|
| 109 |
+
"special": false
|
| 110 |
+
},
|
| 111 |
+
"13": {
|
| 112 |
+
"content": "<|code|>",
|
| 113 |
+
"lstrip": false,
|
| 114 |
+
"normalized": false,
|
| 115 |
+
"rstrip": false,
|
| 116 |
+
"single_word": false,
|
| 117 |
+
"special": false
|
| 118 |
+
},
|
| 119 |
+
"14": {
|
| 120 |
+
"content": "<|file|>",
|
| 121 |
+
"lstrip": false,
|
| 122 |
+
"normalized": false,
|
| 123 |
+
"rstrip": false,
|
| 124 |
+
"single_word": false,
|
| 125 |
+
"special": false
|
| 126 |
+
},
|
| 127 |
+
"102397": {
|
| 128 |
+
"content": "<|prefix|>",
|
| 129 |
+
"lstrip": false,
|
| 130 |
+
"normalized": false,
|
| 131 |
+
"rstrip": false,
|
| 132 |
+
"single_word": false,
|
| 133 |
+
"special": false
|
| 134 |
+
},
|
| 135 |
+
"102398": {
|
| 136 |
+
"content": "<|suffix|>",
|
| 137 |
+
"lstrip": false,
|
| 138 |
+
"normalized": false,
|
| 139 |
+
"rstrip": false,
|
| 140 |
+
"single_word": false,
|
| 141 |
+
"special": false
|
| 142 |
+
},
|
| 143 |
+
"102399": {
|
| 144 |
+
"content": "<|middle|>",
|
| 145 |
+
"lstrip": false,
|
| 146 |
+
"normalized": false,
|
| 147 |
+
"rstrip": false,
|
| 148 |
+
"single_word": false,
|
| 149 |
+
"special": false
|
| 150 |
+
}
|
| 151 |
+
},
|
| 152 |
+
"bos_token": "<s>",
|
| 153 |
+
"chat_template": "\n{%- set user_messages = messages | selectattr('role', 'equalto', 'user') | list %}\n{%- macro output_available_tools(tools, message) %}\n{%- if tools and (message == user_messages[-1]) %}\n {{- '<|available_tools|>[' }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- \"{\" }}\n {%- for key, val in tool.items() if key != \"return\" %}\n {%- if val is string %}\n {{- \"'\" + key + \"': '\" + val + \"'\" }}\n {%- else %}\n {{- \"'\" + key + \"': \" + val|string }}\n {%- endif %}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \"}\" }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" }}\n {%- endif %}\n {%- endfor %}\n {{- eos_token -}}\n{%- endif %}\n{%- endmacro %}\n\n{%- macro output_tool_results(tool_results) %}\n{{- '<|tool_results|>[' }}\n{%- for tool_result in tool_results %}\n {{- \"{'content': \" + tool_result.content|string + \", 'call_id': '\" + tool_result.call_id + \"'}\" }}\n{%- endfor %}\n{{- ']' }}\n{{- eos_token -}}\n{%- endmacro %}\n\n{%- macro output_tool_calls(tool_calls) %}\n{{- '<|tool_calls|>[' }}\n{%- for tool_call in tool_calls %}\n {{- \"{'id': '\" + tool_call.id + \"', 'name': '\" + tool_call.name + \"', 'arguments': \" + tool_call.arguments|string + '}' }}\n{%- endfor %}\n{{- ']' }}\n{%- endmacro %}\n\n{%- for message in messages %}\n {%- if message['role'] == 'user' %}\n {%- if tools is defined %}\n {{- output_available_tools(tools, message) }}\n {%- endif %}\n {{- '<|user|>' + message['content'] + eos_token -}}\n {%- elif message['role'] == 'system' %}\n {{- '<|system|>' + message['content'] + eos_token -}}\n {%- elif message['role'] == 'assistant' %}\n {% set assistant_content = \"\" %}\n {%- if message.content is defined %}\n {% set assistant_content = message.content %}\n {%- endif %}\n {%- if message.tool_calls is defined and message.tool_calls -%}\n {{- '<|assistant|>' + assistant_content + output_tool_calls(message['tool_calls']) + eos_token -}}\n {%- else %}\n {{- '<|assistant|>' + assistant_content + eos_token }}\n {%- endif %}\n {%- elif message['role'] == 'tool_results' %}\n {{- output_tool_results(message.tool_results) }}\n {%- endif %}\n{%- if loop.last and add_generation_prompt -%}\n {{- '<|assistant|>' -}}\n{%- endif -%}\n{%- endfor %}\n",
|
| 154 |
+
"clean_up_tokenization_spaces": false,
|
| 155 |
+
"cls_token": "<cls>",
|
| 156 |
+
"do_lower_case": false,
|
| 157 |
+
"eos_token": "</s>",
|
| 158 |
+
"extra_ids": 0,
|
| 159 |
+
"extra_special_tokens": {},
|
| 160 |
+
"keep_accents": true,
|
| 161 |
+
"legacy": false,
|
| 162 |
+
"mask_token": "<mask>",
|
| 163 |
+
"model_max_length": 1000000000000000019884624838656,
|
| 164 |
+
"pad_token": "<pad>",
|
| 165 |
+
"padding_side": "left",
|
| 166 |
+
"sep_token": "<sep>",
|
| 167 |
+
"sp_model_kwargs": {},
|
| 168 |
+
"spaces_between_special_tokens": false,
|
| 169 |
+
"tokenizer_class": "LlamaTokenizer",
|
| 170 |
+
"unk_token": "<unk>",
|
| 171 |
+
"use_default_system_prompt": false
|
| 172 |
+
}
|