eokayakca
/

Bitig-Nano

Text Generation

Model card Files Files and versions

Bitig-Nano / README.md

eokayakca's picture

Upload folder using huggingface_hub

ae8b76b verified 2 months ago

|

history blame contribute delete

2.14 kB

	---
	language:
	- tr
	- otk
	tags:
	- gokturk
	- text-generation
	license: mit
	---

	# Bitig-Nano

	This is a small AI model that can write text in the Göktürk (Old Turkic) script. It was trained on the Turkish Wikipedia dataset, which was converted into Göktürk letters.

	> [!IMPORTANT]
	> Disclaimer: This project is for fun and hobby purposes only. It is not a professional tool. The model might make mistakes or write things that are not historically accurate. It is a "Nano" sized model created for educational experiments.

	## How to Use

	You can use this model with the Python `transformers` library.

	```python
	from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast

	model_name = "eokayakca/Bitig-Nano"

	tokenizer = PreTrainedTokenizerFast.from_pretrained(model_name)
	model = GPT2LMHeadModel.from_pretrained(model_name)

	prompt = "𐱅𐰇𐰼" # Start with "Tür"
	input_ids = tokenizer.encode(prompt, return_tensors="pt")

	output = model.generate(input_ids, max_length=50)
	generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

	# The output is in Logical Order (Left-to-Right).
	# For correct display, you might need to reverse it to Right-to-Left.
	print(f"Logical (LTR): {generated_text}")
	print(f"Visual (RTL): {generated_text[::-1]}")
	```

	## About the Data

	The model learned from Turkish Wikipedia articles. We changed the Latin letters to Göktürk letters using a custom converter script.

	Technical Note: The text is stored in Logical Order (Left-to-Right) for Unicode compatibility. However, Göktürk script is historically written and read from Right-to-Left. When you view the output, you may need to reverse it visually.

	## Training Details

	- Hardware: Apple M1 Mac (16 GB RAM)
	- Training Time: ~20 hours
	- Epochs: 3
	- Dataset: [eokayakca/turkish-wikipedia-gokturk](https://huggingface.co/datasets/eokayakca/turkish-wikipedia-gokturk)

	## Limitations

	- The model is very small (Nano size).
	- It may generate nonsense words or grammatically incorrect sentences.
	- It is designed for testing and learning, not for serious translation or historical research.