Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: ta
|
| 3 |
+
datasets:
|
| 4 |
+
- oscar
|
| 5 |
+
- IndicNLP
|
| 6 |
+
- Wiki-Tamil novels scrapped data
|
| 7 |
+
|
| 8 |
+
widget:
|
| 9 |
+
- text: 'ஆதித்த கரிகாலர் தஞ்சைக்குச் செல்ல உடனடியாக ஒப்புக்கொண்டார்.'
|
| 10 |
+
|
| 11 |
+
- text: 'நந்தினி பெரிய பழுவேட்டரையரை உண்மையாக நேசித்தால் '
|
| 12 |
+
|
| 13 |
+
- text: 'மதுராந்தகருக்கு இராஜ்யமாளும் விருப்பம் இருப்பதாக இல்லை '
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# GPT2-Kalki
|
| 18 |
+
## Model description
|
| 19 |
+
GPT2-Kalki is a GPT-2 transformer model fine-tuned on corpus of Tamil language data from Wikipedia. Has been specifically finetuned on the works of [Kalki Krishnamurthy](https://en.wikipedia.org/wiki/Kalki_Krishnamurthy) - a Tamil writer from the 1900s.
|
| 20 |
+
This model is an experimentation of "What if" scenarios using the characters of his novels. The famous movie that has been released now [Ponniyin Selvan - I](https://en.wikipedia.org/wiki/Ponniyin_Selvan:_I) is based on the novel written by the same author.
|
| 21 |
+
This model is trained on an already trained model on Tamil dataset from [GPT2-Tamil](https://huggingface.co/abinayam/gpt-2-tamil).
|
| 22 |
+
|
| 23 |
+
## Dataset Used:
|
| 24 |
+
The GTP-2 model is trained on [oscar dataset - ta](https://huggingface.co/datasets/oscar) and [IndicNLP dataset - ta](https://indicnlp.ai4bharat.org/corpora/) and manually scrapped Wikipedia dataset specifically having stories and novels.
|
| 25 |
+
The scrapped dataset will be released soon.
|
| 26 |
+
|
| 27 |
+
## Usage
|
| 28 |
+
You can use this model for Tamil text generation:
|
| 29 |
+
```python
|
| 30 |
+
>>> from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
|
| 31 |
+
>>> tokenizer = AutoTokenizer.from_pretrained('tsaditya/GPT-Kalki')
|
| 32 |
+
>>> model = AutoModelWithLMHead.from_pretrained('tsaditya/GPT-Kalki')
|
| 33 |
+
>>> text = "ஆதித்த கரிகாலர் தஞ்சைக்குச் செல்ல உடனடியாக ஒப்புக்கொண்டார். "
|
| 34 |
+
>>> encoded_text = tokenizer.encode(text, return_tensors='tf')
|
| 35 |
+
>>> beam_output = model.generate(
|
| 36 |
+
encoded_text,
|
| 37 |
+
do_sample=True,
|
| 38 |
+
max_length=512,
|
| 39 |
+
top_k=50,
|
| 40 |
+
top_p=0.95,
|
| 41 |
+
num_return_sequences=1,
|
| 42 |
+
no_repeat_ngram_size = 3,
|
| 43 |
+
temperature = 0.7
|
| 44 |
+
)
|
| 45 |
+
>>> print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
|
| 46 |
+
```
|
| 47 |
+
---
|