Update README.md
Browse files
README.md
CHANGED
|
@@ -25,6 +25,77 @@ Iris provides efficient and accurate translation and can be used in a variety of
|
|
| 25 |
* **base mode** : mistralai/Mistral-7B-v0.1
|
| 26 |
* **dataset** : translation_v3_346k
|
| 27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
## dataset info : translation_v3_346k
|
| 29 |
|
| 30 |
| dataset name | ratio | size |
|
|
|
|
| 25 |
* **base mode** : mistralai/Mistral-7B-v0.1
|
| 26 |
* **dataset** : translation_v3_346k
|
| 27 |
|
| 28 |
+
## usage
|
| 29 |
+
```
|
| 30 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 31 |
+
from utils.simple_bleu import simple_score
|
| 32 |
+
import torch
|
| 33 |
+
|
| 34 |
+
repo = "davidkim205/iris-7b"
|
| 35 |
+
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, device_map='auto')
|
| 36 |
+
tokenizer = AutoTokenizer.from_pretrained(repo)
|
| 37 |
+
|
| 38 |
+
def generate(prompt):
|
| 39 |
+
encoding = tokenizer(
|
| 40 |
+
prompt,
|
| 41 |
+
return_tensors='pt',
|
| 42 |
+
return_token_type_ids=False
|
| 43 |
+
).to("cuda")
|
| 44 |
+
gen_tokens = model.generate(
|
| 45 |
+
**encoding,
|
| 46 |
+
max_new_tokens=2048,
|
| 47 |
+
temperature=1.0,
|
| 48 |
+
num_beams=5,
|
| 49 |
+
)
|
| 50 |
+
prompt_end_size = encoding.input_ids.shape[1]
|
| 51 |
+
result = tokenizer.decode(gen_tokens[0, prompt_end_size:])
|
| 52 |
+
|
| 53 |
+
return result
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
def translate_ko2en(text):
|
| 57 |
+
prompt = f"[INST] 다음 문장을 영어로 번역하세요.{text} [/INST]"
|
| 58 |
+
return generate(prompt)
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
def translate_en2ko(text):
|
| 62 |
+
prompt = f"[INST] 다음 문장을 한글로 번역하세요.{text} [/INST]"
|
| 63 |
+
return generate(prompt)
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
def main():
|
| 67 |
+
while True:
|
| 68 |
+
text = input('>')
|
| 69 |
+
en_text = translate_ko2en(text)
|
| 70 |
+
ko_text = translate_en2ko(en_text)
|
| 71 |
+
print('en_text', en_text)
|
| 72 |
+
print('ko_text', ko_text)
|
| 73 |
+
print('score', simple_score(text, ko_text))
|
| 74 |
+
|
| 75 |
+
if __name__ == "__main__":
|
| 76 |
+
main()
|
| 77 |
+
```
|
| 78 |
+
output
|
| 79 |
+
```
|
| 80 |
+
$ python iris_test.py
|
| 81 |
+
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.05it/s]
|
| 82 |
+
>Iris is a model for Korean-English sentence translation based on deep learning.
|
| 83 |
+
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
|
| 84 |
+
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
|
| 85 |
+
en_text Iris is a model for Korean-English sentence translation based on deep learning.</s>
|
| 86 |
+
ko_text 아이리스는 딥러닝을 기반으로 한 한-영어 문장 번역을 위한 모델이다.</s>
|
| 87 |
+
|
| 88 |
+
```
|
| 89 |
+
## template
|
| 90 |
+
### ko -> en
|
| 91 |
+
```
|
| 92 |
+
[INST] 다음 문장을 영어로 번역하세요.{text} [/INST]
|
| 93 |
+
```
|
| 94 |
+
### en -> ko
|
| 95 |
+
```
|
| 96 |
+
"[INST] 다음 문장을 한글로 번역하세요.{text} [/INST]"
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
## dataset info : translation_v3_346k
|
| 100 |
|
| 101 |
| dataset name | ratio | size |
|