File size: 3,438 Bytes
f9b91c4
e4d6b5a
 
 
f9b91c4
 
 
 
 
 
80b0593
f9b91c4
6288a3c
f9b91c4
 
 
 
6288a3c
 
e4d6b5a
6288a3c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0fb940d
 
6288a3c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0fb940d
6288a3c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80b0593
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
base_model:
- unsloth/gemma-2-9b-bnb-4bit
- google/gemma-2-9b
tags:
- text-generation-inference
- transformers
- unsloth
- gemma2
- trl
license: gemma
language:
- fa
- en
---


# Tuti 🦜

This is a [Gemma 2 9b](https://huggingface.co/google/gemma-2-9b), fined tuned using Unsloth's 4-bit quantization and LORA (QLORA), on Persian literature datasets I curated/created or found.

## Use cases and datasets

### Word IPA Detection

I have fined tuned this model with QLORA and only uploaded the LORA adapter, so it could be used like this:

```python
# pip install unsloth
from unsloth import FastLanguageModel
from transformers import TextStreamer

model_name = "cnababaie/tuti"
max_seq_length = 4096  # Adjust as needed
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)
FastLanguageModel.for_inference(model)
alpaca_prompt_template = """### Instruction:
{}

### Input:
{}

### Response:
{}"""
```

```python
inputs = tokenizer(
[
    alpaca_prompt_template.format(
        "IPA این کلمه چیست؟", # instruction
        "جوینده",
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)
```

This will correctly output IPA as *"/d͡ʒuːjænde/ (*juyande*)"*.

#### IPA Sources

- [IPA-dict](https://github.com/open-dict-data/ipa-dict/tree/master): Monolingual wordlists with pronunciation information in IPA
- [Wiktionary](https://en.wiktionary.org): The Persian corpus don't contain IPA but the English one(which contains many words and phrases in other than English) are a lot of Persian words with their IPA

### Persian Text Romanization

```python
inputs = tokenizer(
[
    alpaca_prompt_template.format(
        "این متن چه تلفظی داره؟", # instruction
        "خاک به خاطر بارش زیاد باران گل شد.",
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)
```

This will output exact pronunciation as *"Xāk be xāter-e bāreš-e ziyād-e bārān gel šod."*.

#### Romanization Sources

- [http://alefbaye2om.org/](http://alefbaye2om.org/): Contain PDFs with Persian Romanized text


### Persian Poem Translation

```python
inputs = tokenizer(
[
    alpaca_prompt_template.format(
        "ترجمه", # instruction
        "برخیز بتا بیا ز بهر دل ما\r\nحل کن به جمال خویشتن مشکل ما\r\nیک کوزه شراب تا به هم نوش کن\r\nزآن پیش که کوزه‌ها کنند از گل ما",
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)
```

This will output rhymed poetry with the original poem content:

*"Arise, O idol, for our heart's sake, 
Solve our troubles with your beauty's make. 
One pot of wine, let's drink it all, 
Before they make pots from our clay's fall."*.

#### Poem Translation Sources

- Created list of random poems from Ganjoor and translation text pair