tuti / README.md
cnababaie's picture
Update README.md
e4d6b5a verified
---
base_model:
- unsloth/gemma-2-9b-bnb-4bit
- google/gemma-2-9b
tags:
- text-generation-inference
- transformers
- unsloth
- gemma2
- trl
license: gemma
language:
- fa
- en
---
# Tuti 🦜
This is a [Gemma 2 9b](https://huggingface.co/google/gemma-2-9b), fined tuned using Unsloth's 4-bit quantization and LORA (QLORA), on Persian literature datasets I curated/created or found.
## Use cases and datasets
### Word IPA Detection
I have fined tuned this model with QLORA and only uploaded the LORA adapter, so it could be used like this:
```python
# pip install unsloth
from unsloth import FastLanguageModel
from transformers import TextStreamer
model_name = "cnababaie/tuti"
max_seq_length = 4096 # Adjust as needed
dtype = None
load_in_4bit = True
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
)
FastLanguageModel.for_inference(model)
alpaca_prompt_template = """### Instruction:
{}
### Input:
{}
### Response:
{}"""
```
```python
inputs = tokenizer(
[
alpaca_prompt_template.format(
"IPA این کلمه چیست؟", # instruction
"جوینده",
"", # output - leave this blank for generation!
)
], return_tensors = "pt").to("cuda")
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)
```
This will correctly output IPA as *"/d͡ʒuːjænde/ (*juyande*)"*.
#### IPA Sources
- [IPA-dict](https://github.com/open-dict-data/ipa-dict/tree/master): Monolingual wordlists with pronunciation information in IPA
- [Wiktionary](https://en.wiktionary.org): The Persian corpus don't contain IPA but the English one(which contains many words and phrases in other than English) are a lot of Persian words with their IPA
### Persian Text Romanization
```python
inputs = tokenizer(
[
alpaca_prompt_template.format(
"این متن چه تلفظی داره؟", # instruction
"خاک به خاطر بارش زیاد باران گل شد.",
"", # output - leave this blank for generation!
)
], return_tensors = "pt").to("cuda")
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)
```
This will output exact pronunciation as *"Xāk be xāter-e bāreš-e ziyād-e bārān gel šod."*.
#### Romanization Sources
- [http://alefbaye2om.org/](http://alefbaye2om.org/): Contain PDFs with Persian Romanized text
### Persian Poem Translation
```python
inputs = tokenizer(
[
alpaca_prompt_template.format(
"ترجمه", # instruction
"برخیز بتا بیا ز بهر دل ما\r\nحل کن به جمال خویشتن مشکل ما\r\nیک کوزه شراب تا به هم نوش کن\r\nزآن پیش که کوزه‌ها کنند از گل ما",
"", # output - leave this blank for generation!
)
], return_tensors = "pt").to("cuda")
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)
```
This will output rhymed poetry with the original poem content:
*"Arise, O idol, for our heart's sake,
Solve our troubles with your beauty's make.
One pot of wine, let's drink it all,
Before they make pots from our clay's fall."*.
#### Poem Translation Sources
- Created list of random poems from Ganjoor and translation text pair