tsaditya commited on
Commit
1a860cb
·
1 Parent(s): 56794a5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -0
README.md ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ta
3
+ datasets:
4
+ - oscar
5
+ - IndicNLP
6
+ - Wiki-Tamil novels scrapped data
7
+
8
+ widget:
9
+ - text: 'ஆதித்த கரிகாலர் தஞ்சைக்குச் செல்ல உடனடியாக ஒப்புக்கொண்டார்.'
10
+
11
+ - text: 'நந்தினி பெரிய பழுவேட்டரையரை உண்மையாக நேசித்தால் '
12
+
13
+ - text: 'மதுராந்தகருக்கு இராஜ்யமாளும் விருப்பம் இருப்பதாக இல்லை '
14
+
15
+ ---
16
+
17
+ # GPT2-Kalki
18
+ ## Model description
19
+ GPT2-Kalki is a GPT-2 transformer model fine-tuned on corpus of Tamil language data from Wikipedia. Has been specifically finetuned on the works of [Kalki Krishnamurthy](https://en.wikipedia.org/wiki/Kalki_Krishnamurthy) - a Tamil writer from the 1900s.
20
+ This model is an experimentation of "What if" scenarios using the characters of his novels. The famous movie that has been released now [Ponniyin Selvan - I](https://en.wikipedia.org/wiki/Ponniyin_Selvan:_I) is based on the novel written by the same author.
21
+ This model is trained on an already trained model on Tamil dataset from [GPT2-Tamil](https://huggingface.co/abinayam/gpt-2-tamil).
22
+
23
+ ## Dataset Used:
24
+ The GTP-2 model is trained on [oscar dataset - ta](https://huggingface.co/datasets/oscar) and [IndicNLP dataset - ta](https://indicnlp.ai4bharat.org/corpora/) and manually scrapped Wikipedia dataset specifically having stories and novels.
25
+ The scrapped dataset will be released soon.
26
+
27
+ ## Usage
28
+ You can use this model for Tamil text generation:
29
+ ```python
30
+ >>> from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
31
+ >>> tokenizer = AutoTokenizer.from_pretrained('tsaditya/GPT-Kalki')
32
+ >>> model = AutoModelWithLMHead.from_pretrained('tsaditya/GPT-Kalki')
33
+ >>> text = "ஆதித்த கரிகாலர் தஞ்சைக்குச் செல்ல உடனடியாக ஒப்புக்கொண்டார். "
34
+ >>> encoded_text = tokenizer.encode(text, return_tensors='tf')
35
+ >>> beam_output = model.generate(
36
+ encoded_text,
37
+ do_sample=True,
38
+ max_length=512,
39
+ top_k=50,
40
+ top_p=0.95,
41
+ num_return_sequences=1,
42
+ no_repeat_ngram_size = 3,
43
+ temperature = 0.7
44
+ )
45
+ >>> print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
46
+ ```
47
+ ---