|
|
--- |
|
|
language: |
|
|
- tr |
|
|
- otk |
|
|
tags: |
|
|
- gokturk |
|
|
- text-generation |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# Bitig-Nano |
|
|
|
|
|
This is a small AI model that can write text in the Göktürk (Old Turkic) script. It was trained on the Turkish Wikipedia dataset, which was converted into Göktürk letters. |
|
|
|
|
|
> [!IMPORTANT] |
|
|
> **Disclaimer:** This project is for **fun and hobby purposes only**. It is not a professional tool. The model might make mistakes or write things that are not historically accurate. It is a "Nano" sized model created for educational experiments. |
|
|
|
|
|
## How to Use |
|
|
|
|
|
You can use this model with the Python `transformers` library. |
|
|
|
|
|
```python |
|
|
from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast |
|
|
|
|
|
model_name = "eokayakca/Bitig-Nano" |
|
|
|
|
|
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_name) |
|
|
model = GPT2LMHeadModel.from_pretrained(model_name) |
|
|
|
|
|
prompt = "𐱅𐰇𐰼" # Start with "Tür" |
|
|
input_ids = tokenizer.encode(prompt, return_tensors="pt") |
|
|
|
|
|
output = model.generate(input_ids, max_length=50) |
|
|
generated_text = tokenizer.decode(output[0], skip_special_tokens=True) |
|
|
|
|
|
# The output is in Logical Order (Left-to-Right). |
|
|
# For correct display, you might need to reverse it to Right-to-Left. |
|
|
print(f"Logical (LTR): {generated_text}") |
|
|
print(f"Visual (RTL): {generated_text[::-1]}") |
|
|
``` |
|
|
|
|
|
## About the Data |
|
|
|
|
|
The model learned from Turkish Wikipedia articles. We changed the Latin letters to Göktürk letters using a custom converter script. |
|
|
|
|
|
**Technical Note:** The text is stored in **Logical Order (Left-to-Right)** for Unicode compatibility. However, Göktürk script is historically written and read from **Right-to-Left**. When you view the output, you may need to reverse it visually. |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Hardware:** Apple M1 Mac (16 GB RAM) |
|
|
- **Training Time:** ~20 hours |
|
|
- **Epochs:** 3 |
|
|
- **Dataset:** [eokayakca/turkish-wikipedia-gokturk](https://huggingface.co/datasets/eokayakca/turkish-wikipedia-gokturk) |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- The model is very small (Nano size). |
|
|
- It may generate nonsense words or grammatically incorrect sentences. |
|
|
- It is designed for testing and learning, not for serious translation or historical research. |
|
|
|