tsaditya
/

GPT-Kalki

Text Generation

text-generation-inference

Model card Files Files and versions

tsaditya commited on Oct 3, 2022

Commit

1a860cb

·

1 Parent(s): 56794a5

Create README.md

Files changed (1) hide show

README.md +47 -0

README.md ADDED Viewed

	@@ -0,0 +1,47 @@

+---
+language: ta
+datasets:
+- oscar
+- IndicNLP
+- Wiki-Tamil novels scrapped data
+widget:
+- text: 'ஆதித்த கரிகாலர் தஞ்சைக்குச் செல்ல உடனடியாக ஒப்புக்கொண்டார்.'
+- text: 'நந்தினி பெரிய பழுவேட்டரையரை உண்மையாக நேசித்தால் '
+- text: 'மதுராந்தகருக்கு இராஜ்யமாளும் விருப்பம் இருப்பதாக இல்லை '
+---
+# GPT2-Kalki
+## Model description
+GPT2-Kalki is a GPT-2 transformer model fine-tuned on corpus of Tamil language data from Wikipedia. Has been specifically finetuned on the works of [Kalki Krishnamurthy](https://en.wikipedia.org/wiki/Kalki_Krishnamurthy) - a Tamil writer from the 1900s.
+This model is an experimentation of "What if" scenarios using the characters of his novels. The famous movie that has been released now [Ponniyin Selvan - I](https://en.wikipedia.org/wiki/Ponniyin_Selvan:_I) is based on the novel written by the same author.
+This model is trained on an already trained model on Tamil dataset from [GPT2-Tamil](https://huggingface.co/abinayam/gpt-2-tamil).
+## Dataset Used:
+The GTP-2 model is trained on [oscar dataset - ta](https://huggingface.co/datasets/oscar) and [IndicNLP dataset - ta](https://indicnlp.ai4bharat.org/corpora/) and manually scrapped Wikipedia dataset specifically having stories and novels.
+The scrapped dataset will be released soon.
+## Usage
+You can use this model for Tamil text generation:
+```python
+>>> from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
+>>> tokenizer = AutoTokenizer.from_pretrained('tsaditya/GPT-Kalki')
+>>> model = AutoModelWithLMHead.from_pretrained('tsaditya/GPT-Kalki')
+>>> text = "ஆதித்த கரிகாலர் தஞ்சைக்குச் செல்ல உடனடியாக ஒப்புக்கொண்டார். "
+>>> encoded_text = tokenizer.encode(text, return_tensors='tf')
+>>> beam_output = model.generate(
+    encoded_text,
+    do_sample=True,
+    max_length=512,
+    top_k=50,
+    top_p=0.95,
+    num_return_sequences=1,
+    no_repeat_ngram_size = 3,
+    temperature = 0.7
+    )
+>>> print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
+```
+---