README.md · abagade/IndianPoliticalTweetsLM at refs/pr/1

IndianPoliticalTweetsLM / README.md

elishowk

Automatic correction of README.md metadata for keys. Contact website@huggingface.co for any question

e0f9855 over 4 years ago

preview code

raw

history blame

3.08 kB

	---
	language: en
	thumbnail: https://bagdeabhishek.github.io/twitterAnalysis_files/networkfin.jpg
	tags:
	- India
	- politics
	- tweets
	- BJP
	- Congress
	- AAP
	- pytorch
	- gpt2
	- lm-head
	- text-generation
	license: apache-2.0
	datasets:
	- Twitter
	- IndianPolitics
	---

	# Model name
	Indian Political Tweets LM

	## Model description
	Note: This model is based on GPT2, if you want a bigger model based on GPT2-medium and finetuned on the same data please take a look at the [IndianPoliticalTweetsLMMedium](https://huggingface.co/bagdaebhishek/IndianPoliticalTweetsLMMedium) model.

	This is a GPT2 Language model with LM head fine-tuned on tweets crawled from handles which belong predominantly to Indian Politics. For more information about the crawled data, you can go through this [blog](https://bagdeabhishek.github.io/twitterAnalysis) post.

	## Intended uses & limitations
	This finetuned model can be used to generate tweets which are related to Indian politics.
	#### How to use

	```python
	from transformers import AutoTokenizer,AutoModelWithLMHead,pipeline
	tokenizer = AutoTokenizer.from_pretrained("bagdaebhishek/IndianPoliticalTweetsLM")
	model = AutoModelWithLMHead.from_pretrained("bagdaebhishek/IndianPoliticalTweetsLM")

	text_generator = pipeline("text-generation",model=model, tokenizer=tokenizer)

	init_sentence = "India will always be"

	print(text_generator(init_sentence))

	```

	#### Limitations and bias
	1. The tweets used to train the model were not manually labelled, so the generated text may not always be in English. I've cleaned the data to remove non-English tweets but the model may generate "Hinglish" text and hence no assumptions should be made about the language of the generated text.
	2. I've taken enough care to remove tweets from twitter handles which are not very influential but since it's not curated by hand there might be some artefacts like "-sent via NamoApp" etc.
	3. Like any language model trained on real-world data this model also exhibits some biases which unfortunately are a part of the political discourse on Twitter. Please keep this in mind while using the output from this model.

	## Training data
	I used the pre-trained gpt2 model from Huggingface transformers repository and fine-tuned it on custom data set crawled from twitter. The method used to identify the political handles is mentioned in detail in a [blog](https://bagdeabhishek.github.io/twitterAnalysis) post. I used tweets from both the Pro-BJP and Anti-BJP clusters mentioned in the blog.

	## Training procedure

	For pre-processing, I removed tweets from handles which are not very influential in their cluster. I removed them by calculating Eigenvector centrality on the twitter graph and pruning handles which have this measure below a certain threshold. This threshold was set manually after experimenting with different values.

	I then separated tweets by these handles based on their language. I trained the LM with English tweets from both handles.

	### Hardware
	1. GPU: GTX 1080Ti
	2. CPU: Ryzen 3900x
	3. RAM: 32GB

	This model took roughly 36 hours to fine-tune.