|
|
--- |
|
|
base_model: |
|
|
- unsloth/gemma-2-9b-bnb-4bit |
|
|
- google/gemma-2-9b |
|
|
tags: |
|
|
- text-generation-inference |
|
|
- transformers |
|
|
- unsloth |
|
|
- gemma2 |
|
|
- trl |
|
|
license: gemma |
|
|
language: |
|
|
- fa |
|
|
- en |
|
|
--- |
|
|
|
|
|
|
|
|
# Tuti 🦜 |
|
|
|
|
|
This is a [Gemma 2 9b](https://huggingface.co/google/gemma-2-9b), fined tuned using Unsloth's 4-bit quantization and LORA (QLORA), on Persian literature datasets I curated/created or found. |
|
|
|
|
|
## Use cases and datasets |
|
|
|
|
|
### Word IPA Detection |
|
|
|
|
|
I have fined tuned this model with QLORA and only uploaded the LORA adapter, so it could be used like this: |
|
|
|
|
|
```python |
|
|
# pip install unsloth |
|
|
from unsloth import FastLanguageModel |
|
|
from transformers import TextStreamer |
|
|
|
|
|
model_name = "cnababaie/tuti" |
|
|
max_seq_length = 4096 # Adjust as needed |
|
|
dtype = None |
|
|
load_in_4bit = True |
|
|
|
|
|
model, tokenizer = FastLanguageModel.from_pretrained( |
|
|
model_name=model_name, |
|
|
max_seq_length=max_seq_length, |
|
|
dtype=dtype, |
|
|
load_in_4bit=load_in_4bit, |
|
|
) |
|
|
FastLanguageModel.for_inference(model) |
|
|
alpaca_prompt_template = """### Instruction: |
|
|
{} |
|
|
|
|
|
### Input: |
|
|
{} |
|
|
|
|
|
### Response: |
|
|
{}""" |
|
|
``` |
|
|
|
|
|
```python |
|
|
inputs = tokenizer( |
|
|
[ |
|
|
alpaca_prompt_template.format( |
|
|
"IPA این کلمه چیست؟", # instruction |
|
|
"جوینده", |
|
|
"", # output - leave this blank for generation! |
|
|
) |
|
|
], return_tensors = "pt").to("cuda") |
|
|
|
|
|
text_streamer = TextStreamer(tokenizer) |
|
|
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64) |
|
|
``` |
|
|
|
|
|
This will correctly output IPA as *"/d͡ʒuːjænde/ (*juyande*)"*. |
|
|
|
|
|
#### IPA Sources |
|
|
|
|
|
- [IPA-dict](https://github.com/open-dict-data/ipa-dict/tree/master): Monolingual wordlists with pronunciation information in IPA |
|
|
- [Wiktionary](https://en.wiktionary.org): The Persian corpus don't contain IPA but the English one(which contains many words and phrases in other than English) are a lot of Persian words with their IPA |
|
|
|
|
|
### Persian Text Romanization |
|
|
|
|
|
```python |
|
|
inputs = tokenizer( |
|
|
[ |
|
|
alpaca_prompt_template.format( |
|
|
"این متن چه تلفظی داره؟", # instruction |
|
|
"خاک به خاطر بارش زیاد باران گل شد.", |
|
|
"", # output - leave this blank for generation! |
|
|
) |
|
|
], return_tensors = "pt").to("cuda") |
|
|
|
|
|
text_streamer = TextStreamer(tokenizer) |
|
|
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64) |
|
|
``` |
|
|
|
|
|
This will output exact pronunciation as *"Xāk be xāter-e bāreš-e ziyād-e bārān gel šod."*. |
|
|
|
|
|
#### Romanization Sources |
|
|
|
|
|
- [http://alefbaye2om.org/](http://alefbaye2om.org/): Contain PDFs with Persian Romanized text |
|
|
|
|
|
|
|
|
### Persian Poem Translation |
|
|
|
|
|
```python |
|
|
inputs = tokenizer( |
|
|
[ |
|
|
alpaca_prompt_template.format( |
|
|
"ترجمه", # instruction |
|
|
"برخیز بتا بیا ز بهر دل ما\r\nحل کن به جمال خویشتن مشکل ما\r\nیک کوزه شراب تا به هم نوش کن\r\nزآن پیش که کوزهها کنند از گل ما", |
|
|
"", # output - leave this blank for generation! |
|
|
) |
|
|
], return_tensors = "pt").to("cuda") |
|
|
|
|
|
text_streamer = TextStreamer(tokenizer) |
|
|
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64) |
|
|
``` |
|
|
|
|
|
This will output rhymed poetry with the original poem content: |
|
|
|
|
|
*"Arise, O idol, for our heart's sake, |
|
|
Solve our troubles with your beauty's make. |
|
|
One pot of wine, let's drink it all, |
|
|
Before they make pots from our clay's fall."*. |
|
|
|
|
|
#### Poem Translation Sources |
|
|
|
|
|
- Created list of random poems from Ganjoor and translation text pair |