Update README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,65 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: "en"
|
| 3 |
+
thumbnail: "https://bagdeabhishek.github.io/twitterAnalysis_files/networkfin.jpg"
|
| 4 |
+
tags:
|
| 5 |
+
- India
|
| 6 |
+
- politics
|
| 7 |
+
- tweets
|
| 8 |
+
- BJP
|
| 9 |
+
- Congress
|
| 10 |
+
- AAP
|
| 11 |
+
- pytorch
|
| 12 |
+
- gpt2
|
| 13 |
+
- lm-head
|
| 14 |
+
- text-generation
|
| 15 |
+
license: "Apache"
|
| 16 |
+
datasets:
|
| 17 |
+
- Twitter
|
| 18 |
+
- IndianPolitics
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# Model name
|
| 22 |
+
Indian Political Tweets LM
|
| 23 |
+
|
| 24 |
+
## Model description
|
| 25 |
+
|
| 26 |
+
This is a GPT2 Language model with LM head fine-tuned on tweets crawled from handles which belong predominantly to Indian Politics. For more information about the crawled data, you can go through this [blog](https://bagdeabhishek.github.io/twitterAnalysis) post.
|
| 27 |
+
|
| 28 |
+
## Intended uses & limitations
|
| 29 |
+
This finetuned model can be used to generate tweets which are related to Indian politics.
|
| 30 |
+
#### How to use
|
| 31 |
+
|
| 32 |
+
```python
|
| 33 |
+
from transformers import AutoTokenizer,AutoModelWithLMHead,pipeline
|
| 34 |
+
tokenizer = AutoTokenizer.from_pretrained("bagdaebhishek/IndianPoliticalTweetsLM")
|
| 35 |
+
model = AutoModelWithLMHead.from_pretrained("bagdaebhishek/IndianPoliticalTweetsLM")
|
| 36 |
+
|
| 37 |
+
text_generator = pipeline("text-generation",model=model, tokenizer=tokenizer)
|
| 38 |
+
|
| 39 |
+
init_sentence = "India will always be"
|
| 40 |
+
|
| 41 |
+
print(text_generator(init_sentence))
|
| 42 |
+
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
#### Limitations and bias
|
| 46 |
+
1. The tweets used to train the model were not manually labelled, so the generated text may not always be in English. I've cleaned the data to remove non-English tweets but the model may generate "Hinglish" text and hence no assumptions should be made about the language of the generated text.
|
| 47 |
+
2. I've taken enough care to remove tweets from twitter handles which are not very influential but since it's not curated by hand there might be some artefacts like "-sent via NamoApp" etc.
|
| 48 |
+
3. Like any language model trained on real-world data this model also exhibits some biases which unfortunately are a part of the political discourse on Twitter. Please keep this in mind while using the output from this model.
|
| 49 |
+
|
| 50 |
+
## Training data
|
| 51 |
+
I used the pre-trained gpt2 model from Huggingface transformers repository and fine-tuned it on custom data set crawled from twitter. The method used to identify the political handles is mentioned in detail in a [blog](https://bagdeabhishek.github.io/twitterAnalysis) post. I used tweets from both the Pro-BJP and Anti-BJP clusters mentioned in the blog.
|
| 52 |
+
|
| 53 |
+
## Training procedure
|
| 54 |
+
|
| 55 |
+
For pre-processing, I removed tweets from handles which are not very influential in their cluster. I removed them by calculating Eigenvector centrality on the twitter graph and pruning handles which have this measure below a certain threshold. This threshold was set manually after experimenting with different values.
|
| 56 |
+
|
| 57 |
+
I then separated tweets by these handles based on their language. I trained the LM with English tweets from both handles.
|
| 58 |
+
|
| 59 |
+
### Hardware
|
| 60 |
+
1. GPU: GTX 1080Ti
|
| 61 |
+
2. CPU: Ryzen 3900x
|
| 62 |
+
3. RAM: 32GB
|
| 63 |
+
|
| 64 |
+
This model took roughly 36 hours to fine-tune.
|
| 65 |
+
|