ddyuudd commited on
Commit
59e7707
·
1 Parent(s): 5ed5631

init commit

Browse files
Files changed (13) hide show
  1. .gitattributes +1 -0
  2. .gitignore +8 -0
  3. CAT-logo.png +3 -0
  4. LICENSE +21 -0
  5. README.md +143 -0
  6. TRAINING.md +121 -0
  7. config.json +30 -0
  8. model.safetensors +3 -0
  9. special_tokens_map.json +51 -0
  10. test.py +22 -0
  11. tokenizer.json +0 -0
  12. tokenizer.model +3 -0
  13. tokenizer_config.json +172 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.png filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ .mypy_cache/
2
+ __pycache__/
3
+ .ipynb_checkpoints/
4
+ env/
5
+ venv/
6
+ *.pyc
7
+ *.pyo
8
+ *.pyd
CAT-logo.png ADDED

Git LFS Details

  • SHA256: 0c6536ff0db067fd5975e8d8667293ac4de306de737e8a3465cabd47efcb5f11
  • Pointer size: 132 Bytes
  • Size of remote file: 1.33 MB
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2026 CyberAgent AI Lab
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CAT-Translate 🐱
2
+
3
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
4
+ [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/cyberagent/CAT-Translate-0.8b/)
5
+
6
+ Tiny Language Model For Japanese and English Bidirectional Translation
7
+
8
+ - **Purrs on your lap** 🐱: Small and efficient! 0.8-3.3B models that run on edge devices.
9
+ - **Swift and Feline Sharp** 🐾: Beats TranslateGemma-12B on text-to-text translation quality.
10
+ - **Adopt and adapt** 🐈: Open source (MIT License) models you can customize and extend.
11
+
12
+ <div align="center">
13
+ <img src="CAT-logo.png" alt="Cat sleeping on top of a laptop." width="200">
14
+ </div>
15
+
16
+ ## Models
17
+
18
+ All models are available on Hugging Face:
19
+
20
+ - [CAT-Translate-0.8B](https://huggingface.co/cyberagent/CAT-Translate-0.8b/)
21
+ - [CAT-Translate-1.4B](https://huggingface.co/cyberagent/CAT-Translate-1.4b/)
22
+ - [CAT-Translate-3.3B (In preparation)](https://huggingface.co/cyberagent/CAT-Translate-3.3b/)
23
+
24
+ ## Evaluation
25
+
26
+ We conducted evaluation on the translation subsets of the following benchmarks:
27
+
28
+ - [The Business Scene Dialogue corpus](https://github.com/tsuruoka-lab/BSD) (BSD)
29
+ - Each conversation is given to the model to translate instead of each sentence.
30
+ - [Court Interpreter](https://github.com/mynlp/court_interpreter) (Court)
31
+ - [JMedBench](https://huggingface.co/datasets/Coldog2333/JMedBench) (JMed)
32
+ - ejmmt subsets are used.
33
+ - [pfmt-bench-fin-ja](https://github.com/pfnet-research/pfmt-bench-fin-ja) (PFMT)
34
+ - [WAT 2025 Patent Translation](https://sites.google.com/view/pat-claims-trans-2025/) (wat-pat-2025)
35
+
36
+ We chose these tasks as benchmarks because (1) they are derived from real world applications and (2) are less overoptimized compared to popular datasets (e.g., WMT).
37
+
38
+ The results are below.
39
+ Overall, our 1.4B model achieved the best overall scores.
40
+ The 0.8B, 1.4B, and 3.3B-beta models achieved the best scores among all models (including closed source) within their respective sizes for both En-Ja and Ja-En translation tasks.
41
+
42
+
43
+ | Model | Avg. BLEU | Avg. BLEU Ja->En | Avg. BLEU En->Ja | BSD (Ja-En) | Court (Ja-En) | JMed (Ja-En) | PFMT (Ja-En) | wat-pat-2025 (Ja-En) | BSD (En-Ja) | JMed (En-Ja) | PFMT (En-Ja) | wat-pat-2025 (En-Ja) |
44
+ |:-------------------------------------------------|----------:|-----------------:|-----------------:|------------:|--------------:|-------------:|-------------:|------------------:|------------:|-------------:|-------------:|------------------:|
45
+ | CyberAgent/CAT-Translate-1.4B | 33.73 | 33.26 | 34.19 | 31.28 | 43.84 | 24.08 | 36.55 | 30.57 | 15.71 | 26.92 | 51.53 | 42.58 |
46
+ | Unbabel/Tower-Plus-9B | 32.41 | 36.84 | 27.99 | 15.43 | 40.54 | 29.13 | 58.00 | 41.10 | 10.00 | 18.80 | 53.00 | 30.16 |
47
+ | google/translategemma-12b-it | 32.24 | 35.81 | 28.68 | 31.58 | 34.30 | 23.46 | 48.75 | 40.97 | 15.92 | 21.79 | 52.53 | 24.47 |
48
+ | CyberAgent/CAT-Translate-3.3B-beta | 30.60 | 30.32 | 30.88 | 17.20 | 38.65 | 23.96 | 40.58 | 31.22 | 16.63 | 26.68 | 53.40 | 26.80 |
49
+ | CyberAgent/CAT-Translate-0.8B | 30.42 | 29.71 | 30.68 | 29.63 | 33.19 | 22.96 | 32.51 | 30.56 | 14.60 | 26.22 | 50.62 | 32.87 |
50
+ | google/translategemma-4b-it | 28.09 | 29.41 | 26.76 | 28.86 | 25.89 | 21.50 | 42.65 | 28.16 | 14.14 | 20.68 | 51.99 | 20.23 |
51
+ | LiquidAI/LFM2.5-1.2B-JP | 25.47 | 24.51 | 26.43 | 19.06 | 29.99 | 22.10 | 43.61 | 7.80 | 14.57 | 23.85 | 54.77 | 12.54 |
52
+ | pfnet/plamo-2-translate | 25.24 | 25.92 | 24.57 | 25.55 | 28.63 | 22.90 | 29.02 | 23.48 | 17.35 | 24.98 | 32.04 | 23.89 |
53
+ | LiquidAI/LFM2-350M-ENJP-MT | 24.95 | 24.91 | 25.00 | 10.94 | 29.56 | 21.48 | 41.40 | 21.17 | 8.11 | 22.84 | 47.53 | 21.52 |
54
+ | mistralai/Ministral-8B-Instruct-2410 | 24.12 | 27.52 | 20.71 | 19.23 | 29.21 | 16.25 | 50.23 | 22.69 | 12.91 | 16.49 | 41.66 | 11.80 |
55
+ | Rakuten/RakutenAI-2.0-mini-instruct | 18.43 | 17.24 | 19.62 | 0.11 | 30.62 | 18.21 | 29.34 | 7.90 | 5.19 | 20.36 | 45.70 | 7.23 |
56
+ | SakanaAI/TinySwallow-1.5B-Instruct | 15.74 | 14.99 | 16.49 | 4.96 | 18.93 | 15.83 | 26.67 | 8.58 | 6.30 | 17.58 | 34.07 | 8.00 |
57
+ | llm-jp/llm-jp-3.1-1.8b-instruct4 | 15.18 | 16.26 | 14.11 | 18.82 | 2.44 | 15.67 | 30.65 | 13.72 | 15.38 | 4.91 | 25.47 | 10.65 |
58
+ | tencent/HY-MT1.5-1.8B | 14.49 | 8.95 | 20.04 | 5.50 | 4.59 | 4.00 | 15.67 | 14.98 | 6.33 | 18.13 | 37.75 | 17.96 |
59
+ | shisa-ai/shisa-v2.1-llama3.2-3b | 14.27 | 14.26 | 14.28 | 17.08 | 3.70 | 8.26 | 26.86 | 15.42 | 13.18 | 5.54 | 25.97 | 12.41 |
60
+ | google/gemma-2-2b-jpn-it | 14.15 | 16.98 | 11.32 | 20.04 | 8.08 | 11.27 | 31.49 | 14.01 | 12.37 | 4.48 | 16.24 | 12.21 |
61
+ | shisa-ai/shisa-v2.1-lfm2-1.2b | 13.08 | 14.02 | 12.14 | 20.93 | 4.95 | 7.68 | 26.72 | 9.80 | 12.11 | 5.54 | 17.60 | 13.30 |
62
+ | microsoft/phi-4 | 11.92 | 13.48 | 10.36 | 6.10 | 18.66 | 2.81 | 24.86 | 14.98 | 3.24 | 6.97 | 14.36 | 16.87 |
63
+ | tencent/HY-MT1.5-7B | 10.56 | 13.46 | 7.67 | 4.99 | 12.32 | 5.72 | 29.53 | 14.76 | 0.82 | 7.80 | 14.30 | 7.74 |
64
+ | tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.5 | 10.35 | 12.42 | 8.28 | 24.25 | 2.30 | 3.69 | 14.11 | 17.74 | 6.82 | 2.37 | 11.21 | 12.71 |
65
+ | Qwen/Qwen2.5-14B-Instruct | 8.39 | 9.88 | 6.89 | 10.81 | 4.70 | 4.27 | 11.18 | 18.46 | 4.01 | 3.69 | 13.42 | 6.42 |
66
+ | meta-llama/Llama-3.2-3B-Instruct | 6.06 | 9.90 | 2.23 | 18.60 | 0.41 | 2.72 | 16.62 | 11.17 | 1.44 | 1.10 | 4.50 | 1.87 |
67
+
68
+ A detailed experimental evaluation will be present in a technical report.
69
+
70
+ ## Usage
71
+
72
+ The model supports English to Japanese and Japanese to English translation with the following prompt format:
73
+
74
+ ```python
75
+ from transformers import pipeline
76
+
77
+ # Load the model
78
+ chat_pipeline = pipeline("text-generation", model="CyberAgent/CAT-Translate-0.8b")
79
+
80
+ # Define the prompt template
81
+ prompt = "Translate the following {src_lang} text into {tgt_lang}.\n\n{src_text}"
82
+
83
+ # Example: Japanese to English
84
+ src_lang = "Japanese"
85
+ tgt_lang = "English"
86
+ src_text = "🐈はとてもかわいいの。おててがまるくてふわふわなの。"
87
+
88
+ user_input = [{"role": "user", "content": prompt.format(src_lang=src_lang, tgt_lang=tgt_lang, src_text=src_text)}]
89
+
90
+ response = chat_pipeline(user_input)
91
+
92
+ print("-" * 20)
93
+ print("Source Text:")
94
+ print(src_text)
95
+ print("Translation:")
96
+ print(response[0]['generated_text'][-1]['content'])
97
+ ```
98
+
99
+ **Important**: You need to apply the chat template to run the model correctly. The template is the same as [sarashina2.2-0.5b-instruct-v0.1](https://huggingface.co/sbintuitions/sarashina2.2-0.5b-instruct-v0.1).
100
+
101
+ ### Why Use Instructions?
102
+
103
+ Although the model is specialized for machine translation, we require an instruction prompt to invoke the translation capability. This design choice provides better customizability—extending and merging this model is easier this way. Since the model is open source, any extensions are welcome!
104
+
105
+ ## Training
106
+
107
+ We used the [sarashina2.2 series](https://huggingface.co/collections/sbintuitions/sarashina22) ([MIT LICENSE](https://huggingface.co/sbintuitions/sarashina2.2-0.5b/blob/main/LICENSE)) as our pretrained model. While Qwen-3 showed higher benchmark scores, we found that sarashina generated more natural Japanese text that avoided "translationese" patterns. We hypothesized that naturalness is more difficult to learn than translation accuracy, leading us to choose sarashina as our base model.
108
+
109
+ Our training process involved:
110
+ - Synthesizing parallel corpora from monolingual data using large language models
111
+ - Two-stage supervised fine-tuning (SFT) approach
112
+ - Reinforcement learning with [Multi-Objective GRPO (Ichihara et al. 2025)](https://arxiv.org/abs/2509.22047)
113
+ - LoRA for efficient training
114
+
115
+ For detailed information about our training methodology, data preparation, and technical specifications, please see [TRAINING.md](TRAINING.md).
116
+
117
+ ## License
118
+
119
+ The model is licensed under the [MIT License](LICENSE).
120
+
121
+ ## Citation
122
+
123
+ ```bibtex
124
+ @misc{cat-translate-2026,
125
+ title={CAT-Translate: Tiny Language Model For Japanese and English Bidirectional Translation},
126
+ author={Yuu Jinnai},
127
+ year={2026},
128
+ url={https://huggingface.co/cyberagent/CAT-Translate-0.8b}
129
+ }
130
+ ```
131
+
132
+ ## Acknowledgments
133
+
134
+ This project stands on the shoulders of giants. In particular, the following resources significantly helped us develop the model:
135
+
136
+ - [sarashina](https://huggingface.co/sbintuitions) by SB Intuitions
137
+ - [gpt-oss](https://huggingface.co/openai/gpt-oss-20b) by OpenAI
138
+ - [MetricX](https://huggingface.co/google/metricx-24-hybrid-xl-v2p6-bfloat16) by Juraj Juraska et al.
139
+ - [Duplodocus](https://github.com/allenai/duplodocus) by AllenAI
140
+ - [fastText](https://github.com/facebookresearch/fastText) by Facebook Research
141
+ - [COMET](https://huggingface.co/Unbabel/wmt22-comet-da) by Ricardo Rei et al.
142
+ - [sacrebleu](https://github.com/mjpost/sacrebleu) by Matt Post
143
+ - Mitsuki Sakamoto for deploying the model with UI for internal testing
TRAINING.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Training Details
2
+
3
+ This document provides detailed information about the training methodology used to develop the CAT-Translate models.
4
+ The details will be available in a technical report.
5
+
6
+ ## Table of Contents
7
+
8
+ - [Training Data](#training-data)
9
+ - [Supervised Fine-Tuning](#supervised-fine-tuning)
10
+ - [Reinforcement Learning](#reinforcement-learning)
11
+ - [LoRA Configuration](#lora-configuration)
12
+
13
+ ## Training Data
14
+
15
+ We synthesized parallel corpora from monolingual data using large language models. For generating translations, we used:
16
+
17
+ - **DeepSeek-V3**: Used for initial prototyping only, not used for the rest of the development.
18
+ - **gpt-oss-20b**: Generated most of the data, providing sufficiently high quality for many instances.
19
+ - **gpt-oss-120b**: Used for domains where gpt-oss-20b was not satisfactory (e.g., scientific abstracts).
20
+
21
+ ### Data Filtering
22
+
23
+ The synthesized data were filtered according to several criteria:
24
+
25
+ - Texts written mostly in languages other than Japanese and English
26
+ - The ratio of Japanese and English text lengths being too large or too small
27
+ - Duplicated content using MinHash ([Duplodocus](https://github.com/allenai/duplodocus))
28
+ - Low quality according to BLEU score and/or COMET score ([comet-qe](https://huggingface.co/Unbabel/wmt22-comet-da))
29
+ - Manually identified low quality texts
30
+ - Rule-based algorithms manually coded to filter low quality texts identified by hand
31
+
32
+ ## Supervised Fine-Tuning
33
+
34
+ We applied a two-stage fine-tuning approach.
35
+
36
+ ### First Stage: Focus on Diversity
37
+
38
+ The first stage focused on diversity of prompts. The dataset consisted of:
39
+
40
+ - Mostly web-crawled data with relatively low quality translation
41
+ - Some portion from targeted domains including:
42
+ - Scientific abstracts (arXiv and PubMed)
43
+ - Patents (USPTO)
44
+ - Most instances were sentence-long, with some paragraph-long instances
45
+
46
+ **Key Finding**: We found that the performance of the models mostly saturated with this corpus at around 100k instances. This led us to prepare a more challenging and higher quality dataset for the second stage.
47
+
48
+ ### Second Stage: Focus on Quality
49
+
50
+ The second stage focused on the quality of generated translations. Key characteristics:
51
+
52
+ - Large portion of data instances generated by **gpt-oss-120b**
53
+ - Focus areas:
54
+ - Scientific abstracts (arXiv and PubMed)
55
+ - Patents (USPTO)
56
+ - Underspecified/misspecified text (e.g., input with typo)
57
+ - Most instances were paragraph-long to multiple-paragraph-long
58
+ - Kept some data from the first stage corpus to maintain diversity
59
+
60
+ ## Reinforcement Learning
61
+
62
+ We used the same corpus as the second stage of SFT. The model was trained with **Multi-Objective GRPO** (Ichihara et al. 2025).
63
+
64
+ ### Primary Reward Model: MetricX-24
65
+
66
+ We chose [MetricX-24](https://huggingface.co/google/metricx-24-hybrid-xl-v2p6-bfloat16) as our primary reward model for the following reasons:
67
+
68
+ - Open source
69
+ - Faster than LLM-as-a-Judge models
70
+ - High agreement with human judgments
71
+
72
+ We also considered using gpt-oss-120b as a judge, which has very high accuracy. However, it requires significantly more computational resources that were not available under our constraints.
73
+
74
+ ### MetricX Limitations
75
+
76
+ Like all reward models, MetricX has several misspecifications that generation models may exploit:
77
+
78
+ 1. **Language-agnostic**: Being multilingual, it gives scores regardless of output language, even when the task requires generating Japanese text.
79
+ 2. **Format-agnostic**: Syntactic characters such as newlines (`\n`) and markdown syntax (e.g., `*`, `#`) are ignored.
80
+ 3. **Allows hallucination**: MetricX is relatively tolerant to hallucination as long as the output text contains the information in the input text. This is not ideal for training language model-based machine translation systems.
81
+
82
+ ### Auxiliary Reward Functions
83
+
84
+ To remedy these problems, we implemented auxiliary reward functions:
85
+
86
+ #### 1. BLEU Score (Weight: 0.1)
87
+
88
+ Used to compute lexical overlap with reference text. Expected to be effective for:
89
+ - Avoiding overoptimization to the other reward model
90
+ - Giving reward to accurate translation of technical terms
91
+
92
+ #### 2. Format Consistency
93
+
94
+ Texts too different in format are penalized. This addresses the issue where models often generate markdown-formatted text even when the input is plain text.
95
+
96
+ #### 3. Length Penalty
97
+
98
+ Texts that are too long or too short are penalized. This suppressed many hallucinations generated by the models.
99
+
100
+ ### Reward Normalization Strategy
101
+
102
+ - **MetricX and BLEU** (weights 1.0 and 0.1): Applied with normalization to compute advantages
103
+ - **Rationale**: Translation quality is difficult to learn, and training pedagogically with relative advantage makes sense
104
+
105
+ - **Format consistency and length penalty**: Applied absolutely without normalization (as in Dr. GRPO)
106
+ - **Rationale**: These are easy to learn by themselves and exist to keep the model from getting out of control. The penalties should be large enough to prevent violations regardless of translation quality improvements, and the model should be able to learn them. Thus, we penalize with a large absolute value rather than a relative value.
107
+
108
+ ## LoRA Configuration
109
+
110
+ We used LoRA (Low-Rank Adaptation) to reduce computational resource requirements.
111
+
112
+ | Model Size | LoRA Usage |
113
+ |------------|-----------|
114
+ | 0.5B | Not used |
115
+ | 1B | For GRPO |
116
+ | 3B | For the second stage of SFT and GRPO |
117
+ | 7B | For all processes |
118
+
119
+ ---
120
+
121
+ For the main project overview, see [README.md](README.md).
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 1,
8
+ "dtype": "bfloat16",
9
+ "eos_token_id": 2,
10
+ "head_dim": 80,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 1280,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 4480,
15
+ "max_position_embeddings": 8192,
16
+ "mlp_bias": false,
17
+ "model_type": "llama",
18
+ "num_attention_heads": 16,
19
+ "num_hidden_layers": 24,
20
+ "num_key_value_heads": 8,
21
+ "pad_token_id": 3,
22
+ "pretraining_tp": 1,
23
+ "rms_norm_eps": 1e-05,
24
+ "rope_scaling": null,
25
+ "rope_theta": 500000,
26
+ "tie_word_embeddings": false,
27
+ "transformers_version": "4.57.6",
28
+ "use_cache": false,
29
+ "vocab_size": 102400
30
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3375cdd5cb11dbc5036df4902cf20afbc5ffa560f51bb138bd4ffa8e8c0b10f9
3
+ size 1586121792
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<cls>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "<sep>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
test.py ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import pipeline
2
+
3
+ # model_name = "CyberAgent/CAT-Translate-0.8b"
4
+ # chat_pipeline = pipeline("text-generation", model_name)
5
+
6
+ chat_pipeline = pipeline("text-generation", model=".")
7
+
8
+ prompt = "Translate the following {src_lang} text into {tgt_lang}.\n\n {src_text}"
9
+
10
+ src_lang = "Japanese"
11
+ tgt_lang = "English"
12
+ src_text = "🐈はとてもかわいいの。おててがまるくてふわふわなの。"
13
+
14
+ user_input = [{"role": "user", "content": prompt.format(src_lang=src_lang, tgt_lang=tgt_lang, src_text=src_text)}]
15
+
16
+ response = chat_pipeline(user_input)
17
+
18
+ print("-" * 20)
19
+ print("Source Text:")
20
+ print(src_text)
21
+ print("Translation:")
22
+ print(response[0]['generated_text'][-1]['content'])
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:008293028e1a9d9a1038d9b63d989a2319797dfeaa03f171093a57b33a3a8277
3
+ size 1831879
tokenizer_config.json ADDED
@@ -0,0 +1,172 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_dummy_prefix_space": false,
4
+ "add_eos_token": false,
5
+ "add_prefix_space": false,
6
+ "added_tokens_decoder": {
7
+ "0": {
8
+ "content": "<unk>",
9
+ "lstrip": false,
10
+ "normalized": false,
11
+ "rstrip": false,
12
+ "single_word": false,
13
+ "special": true
14
+ },
15
+ "1": {
16
+ "content": "<s>",
17
+ "lstrip": false,
18
+ "normalized": false,
19
+ "rstrip": false,
20
+ "single_word": false,
21
+ "special": true
22
+ },
23
+ "2": {
24
+ "content": "</s>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false,
29
+ "special": true
30
+ },
31
+ "3": {
32
+ "content": "<pad>",
33
+ "lstrip": false,
34
+ "normalized": false,
35
+ "rstrip": false,
36
+ "single_word": false,
37
+ "special": true
38
+ },
39
+ "4": {
40
+ "content": "<sep>",
41
+ "lstrip": false,
42
+ "normalized": false,
43
+ "rstrip": false,
44
+ "single_word": false,
45
+ "special": true
46
+ },
47
+ "5": {
48
+ "content": "<mask>",
49
+ "lstrip": false,
50
+ "normalized": false,
51
+ "rstrip": false,
52
+ "single_word": false,
53
+ "special": true
54
+ },
55
+ "6": {
56
+ "content": "<cls>",
57
+ "lstrip": false,
58
+ "normalized": false,
59
+ "rstrip": false,
60
+ "single_word": false,
61
+ "special": true
62
+ },
63
+ "7": {
64
+ "content": "<|system|>",
65
+ "lstrip": false,
66
+ "normalized": false,
67
+ "rstrip": false,
68
+ "single_word": false,
69
+ "special": false
70
+ },
71
+ "8": {
72
+ "content": "<|assistant|>",
73
+ "lstrip": false,
74
+ "normalized": false,
75
+ "rstrip": false,
76
+ "single_word": false,
77
+ "special": false
78
+ },
79
+ "9": {
80
+ "content": "<|user|>",
81
+ "lstrip": false,
82
+ "normalized": false,
83
+ "rstrip": false,
84
+ "single_word": false,
85
+ "special": false
86
+ },
87
+ "10": {
88
+ "content": "<|available_tools|>",
89
+ "lstrip": false,
90
+ "normalized": false,
91
+ "rstrip": false,
92
+ "single_word": false,
93
+ "special": false
94
+ },
95
+ "11": {
96
+ "content": "<|tool_calls|>",
97
+ "lstrip": false,
98
+ "normalized": false,
99
+ "rstrip": false,
100
+ "single_word": false,
101
+ "special": false
102
+ },
103
+ "12": {
104
+ "content": "<|tool_results|>",
105
+ "lstrip": false,
106
+ "normalized": false,
107
+ "rstrip": false,
108
+ "single_word": false,
109
+ "special": false
110
+ },
111
+ "13": {
112
+ "content": "<|code|>",
113
+ "lstrip": false,
114
+ "normalized": false,
115
+ "rstrip": false,
116
+ "single_word": false,
117
+ "special": false
118
+ },
119
+ "14": {
120
+ "content": "<|file|>",
121
+ "lstrip": false,
122
+ "normalized": false,
123
+ "rstrip": false,
124
+ "single_word": false,
125
+ "special": false
126
+ },
127
+ "102397": {
128
+ "content": "<|prefix|>",
129
+ "lstrip": false,
130
+ "normalized": false,
131
+ "rstrip": false,
132
+ "single_word": false,
133
+ "special": false
134
+ },
135
+ "102398": {
136
+ "content": "<|suffix|>",
137
+ "lstrip": false,
138
+ "normalized": false,
139
+ "rstrip": false,
140
+ "single_word": false,
141
+ "special": false
142
+ },
143
+ "102399": {
144
+ "content": "<|middle|>",
145
+ "lstrip": false,
146
+ "normalized": false,
147
+ "rstrip": false,
148
+ "single_word": false,
149
+ "special": false
150
+ }
151
+ },
152
+ "bos_token": "<s>",
153
+ "chat_template": "\n{%- set user_messages = messages | selectattr('role', 'equalto', 'user') | list %}\n{%- macro output_available_tools(tools, message) %}\n{%- if tools and (message == user_messages[-1]) %}\n {{- '<|available_tools|>[' }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- \"{\" }}\n {%- for key, val in tool.items() if key != \"return\" %}\n {%- if val is string %}\n {{- \"'\" + key + \"': '\" + val + \"'\" }}\n {%- else %}\n {{- \"'\" + key + \"': \" + val|string }}\n {%- endif %}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \"}\" }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" }}\n {%- endif %}\n {%- endfor %}\n {{- eos_token -}}\n{%- endif %}\n{%- endmacro %}\n\n{%- macro output_tool_results(tool_results) %}\n{{- '<|tool_results|>[' }}\n{%- for tool_result in tool_results %}\n {{- \"{'content': \" + tool_result.content|string + \", 'call_id': '\" + tool_result.call_id + \"'}\" }}\n{%- endfor %}\n{{- ']' }}\n{{- eos_token -}}\n{%- endmacro %}\n\n{%- macro output_tool_calls(tool_calls) %}\n{{- '<|tool_calls|>[' }}\n{%- for tool_call in tool_calls %}\n {{- \"{'id': '\" + tool_call.id + \"', 'name': '\" + tool_call.name + \"', 'arguments': \" + tool_call.arguments|string + '}' }}\n{%- endfor %}\n{{- ']' }}\n{%- endmacro %}\n\n{%- for message in messages %}\n {%- if message['role'] == 'user' %}\n {%- if tools is defined %}\n {{- output_available_tools(tools, message) }}\n {%- endif %}\n {{- '<|user|>' + message['content'] + eos_token -}}\n {%- elif message['role'] == 'system' %}\n {{- '<|system|>' + message['content'] + eos_token -}}\n {%- elif message['role'] == 'assistant' %}\n {% set assistant_content = \"\" %}\n {%- if message.content is defined %}\n {% set assistant_content = message.content %}\n {%- endif %}\n {%- if message.tool_calls is defined and message.tool_calls -%}\n {{- '<|assistant|>' + assistant_content + output_tool_calls(message['tool_calls']) + eos_token -}}\n {%- else %}\n {{- '<|assistant|>' + assistant_content + eos_token }}\n {%- endif %}\n {%- elif message['role'] == 'tool_results' %}\n {{- output_tool_results(message.tool_results) }}\n {%- endif %}\n{%- if loop.last and add_generation_prompt -%}\n {{- '<|assistant|>' -}}\n{%- endif -%}\n{%- endfor %}\n",
154
+ "clean_up_tokenization_spaces": false,
155
+ "cls_token": "<cls>",
156
+ "do_lower_case": false,
157
+ "eos_token": "</s>",
158
+ "extra_ids": 0,
159
+ "extra_special_tokens": {},
160
+ "keep_accents": true,
161
+ "legacy": false,
162
+ "mask_token": "<mask>",
163
+ "model_max_length": 1000000000000000019884624838656,
164
+ "pad_token": "<pad>",
165
+ "padding_side": "left",
166
+ "sep_token": "<sep>",
167
+ "sp_model_kwargs": {},
168
+ "spaces_between_special_tokens": false,
169
+ "tokenizer_class": "LlamaTokenizer",
170
+ "unk_token": "<unk>",
171
+ "use_default_system_prompt": false
172
+ }