## Viral Tweets: User exploration

In this notebook, we will explore the users who have tweeted viral tweets. Namely, we will focus our analysis on the viral tweets from the user point of view. For example, we'll examine the popularity of the user vs the popularity of his tweets, the history of his tweets and analyze any flagrant changes in their features when they became viral, etc.

## 0 - Setup

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

from tqdm import tqdm

#pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

DATA_PATH = "../../data"
VIRAL_TWEETS_PATH = f"{DATA_PATH}/viral_users"

In [2]:
from helper.twitter_client_wrapper import TwitterClientWrapper, EXPANSIONS, MEDIA_FIELDS, TWEET_FIELDS, USER_FIELDS

twitter_client_wrapper = TwitterClientWrapper("../../api_key.yaml", wait_on_rate_limit=False)

## 1 - Retrieve the data from disk

### 1.1 Retrieve the viral tweets data

**Note**: You may notice that all tweets have been retrieved, since some may have been deleted since scraping them.

**Note 2**: Also keep in mind that when retrieving users, the number of users may be less because users may have two or more viral tweets in the sample of viral tweets we have.  

In [3]:
# dtypes={"id": str, "author_id": str, "has_media": bool, "possibly_sensitive": bool}
dtypes={"id": str, "author_id": str}

In [4]:
# Import tweets first
viral_tweets_df = pd.read_csv(f"{VIRAL_TWEETS_PATH}/all_tweets.csv", dtype=dtypes, escapechar='\\', encoding='utf-8')
# viral_tweets_df = pd.read_csv(f"{VIRAL_TWEETS_PATH}/all_tweets.csv", dtype=dtypes)
viral_tweets_df.head()

  viral_tweets_df = pd.read_csv(f"{VIRAL_TWEETS_PATH}/all_tweets.csv", dtype=dtypes, escapechar='\\', encoding='utf-8')


Unnamed: 0,created_at,author_id,text,possibly_sensitive,edit_history_tweet_ids,lang,id,mentions,retweet_count,reply_count,like_count,quote_count,context_annotations,urls,has_media,annotations,hashtags,attachments.poll_ids,withheld.copyright,withheld.country_codes,withheld.scope,cashtags,geo.place_id,geo.coordinates.type,geo.coordinates.coordinates
0,2022-10-31T03:21:11.000Z,1047733077898739712,@manjirosx you too jiroü´∂üèΩ,False,['1586921195059834880'],en,1586921195059834880,"[{'start': 0, 'end': 10, 'username': 'manjiros...",0.0,0.0,1.0,0.0,,,False,,,,,,,,,,
1,2022-10-31T03:13:57.000Z,1047733077898739712,@ilyicey u omd,False,['1586919376086704129'],nl,1586919376086704129,"[{'start': 0, 'end': 8, 'username': 'ilyicey',...",0.0,0.0,0.0,0.0,,,False,,,,,,,,,,
2,2022-10-31T03:13:24.000Z,1047733077898739712,@ilyicey i‚Äôm fine,False,['1586919239243296768'],en,1586919239243296768,"[{'start': 0, 'end': 8, 'username': 'ilyicey',...",1.0,1.0,2.0,0.0,,,False,,,,,,,,,,
3,2022-10-30T22:49:53.000Z,1047733077898739712,@imVolo_ I‚Äôll unfollow rn,False,['1586852923706732544'],en,1586852923706732544,"[{'start': 0, 'end': 8, 'username': 'imVolo_',...",0.0,0.0,3.0,0.0,,,False,,,,,,,,,,
4,2022-10-30T22:45:33.000Z,1047733077898739712,‚Äúwhat do you want to be for halloween?‚Äù his li...,False,['1586851830767591424'],en,1586851830767591424,,611.0,19.0,4132.0,55.0,"[{'domain': {'id': '29', 'name': 'Events [Enti...",,False,,,,,,,,,,


In [5]:
viral_tweets_df[~viral_tweets_df.annotations.isna()].text.iloc[10]

'RT @strbrkrr: apple be like "high volume may damage your ears..." ok‚Ä¶ i don‚Äôt care'

### 1.2 - Retrieve viral tweets users

We start by retrieving the viral tweets users. Users are **included as expansions** when retrieving the tweets, conveniently so. For each user, we retrieve this user's history and information.

In [None]:
# Retrieve the user id. The user data is included in the 'includes' field which we get by if we retrieve any expansions
users_df = pd.read_csv(f"{VIRAL_TWEETS_PATH}/users.csv", dtype={"id": str, "pinned_tweet_id": str}, escapechar="\\")
users_df

In [None]:
'''
id                        object
edit_history_tweet_ids    object
author_id                 object
created_at                object
possibly_sensitive          bool
text                      object
retweet_count              int64
reply_count                int64
like_count                 int64
quote_count                int64
has_media                   bool
urls                      object
context_annotations       object
annotations               object
hashtags                  object
geo.place_id              object
mentions                  object
dtype: object
'''
viral_tweets_df.dtypes

## 2 - Analysis of single user

Let's observe the tweets of single user who has tweeted viral tweets. We'll try to conduct some analysis on their features to try and see what changed in the tweets of the user over time, and how they reflect the changes in the behaviour of the user.

In [None]:
# Take first user
user_id = users_df.iloc[0].id

In [None]:
user_tweets = viral_tweets_df[viral_tweets_df.author_id == user_id]
user_tweets['created_at'] = pd.to_datetime(user_tweets.created_at)
user_tweets.head()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10,5))

ax[0].set_title("Retweet Count vs Tweet Date")
sns.lineplot(user_tweets, x='created_at', y='retweet_count', ax=ax[0])

ax[1].set_title("Like Count vs Tweet Date")
sns.lineplot(user_tweets, x='created_at', y='like_count', ax=ax[1])

plt.tight_layout()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10,5))

user_tweets['tweet_length'] = user_tweets['text'].apply(len)

ax[0].set_title("Retweet Count vs Tweet Length")
sns.lineplot(user_tweets, x='tweet_length', y='retweet_count', ax=ax[0])

ax[1].set_title("Like Count vs Tweet Length")
sns.lineplot(user_tweets, x='tweet_length', y='like_count', ax=ax[1])

plt.tight_layout()

In [None]:
# Has media
sns.jointplot(user_tweets, x='has_media', y='retweet_count')

plt.suptitle("# Retweets vs Tweet has media")
plt.tight_layout()

In [None]:
sns.pairplot(user_tweets[['tweet_length', 'has_media', 'retweet_count', 'like_count']])

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(10,5))

user_tweets['tweet_length'] = user_tweets['text'].apply(len)

ax[0][0].set_title("Retweet Count vs Date")
sns.lineplot(user_tweets, x='created_at', y='retweet_count', ax=ax[0][0])

ax[0][1].set_title("Like Count vs Date")
sns.lineplot(user_tweets, x='created_at', y='like_count', ax=ax[0][1])

ax[1][0].set_title("Has Media vs Date")
sns.scatterplot(user_tweets, x='created_at', y='has_media', ax=ax[1][0])

ax[1][1].set_title("Tweet Length vs Date")
sns.scatterplot(user_tweets, x='created_at', y='tweet_length', ax=ax[1][1])

plt.tight_layout()

In [None]:
### TODO: Analyze the change in tweet features depending on date (one row depending on date, other depending on retweet count to reflect the evolution)
### TODO: Concentration on topics [group by topics for a sample user]

## 3 - Aggregate Analysis of all viral users tweets

#### 3.0 - How many tweets per user retrieved

In [None]:
tweets_per_user = viral_tweets_df.groupby(by='author_id').size().reset_index(name='count')
tweets_per_user.sort_values(by='count')

In [None]:
tweets_per_user.hist(column='count', bins=10)
plt.title("Histogram of distribution of number of tweets retrieved per user")

#### 3.1 - Retweet count vs like count

In order to come up with a metric for the **virality** of the tweet, we need to know which features we will use to determine this metric. *retweet_count* and *like_count* will surely be among those features selected. Let's how the two correlate.

**NOTE**: "The retweet will not show the likes and replies, only retweet count. You need to get the counts from the original tweet, which would be referenced in referenced_tweets and included in includes.tweets part of the response." - Twitter Community

In [None]:
# Remove all tweets that might be retweets of others
retweeted = viral_tweets_df.retweet_count !=0
liked = viral_tweets_df.like_count !=0
original_tweets_df = viral_tweets_df[retweeted & liked]

# Remove NA in retweet and like count
original_tweets_df = original_tweets_df.dropna(axis=0, subset=['retweet_count', 'like_count'])

sns.scatterplot(data=original_tweets_df, x='retweet_count', y='like_count')

**Finding**: We can see more or less a linear correlation. Especially for lower numbers.

#### 3.2 - (# Retweets / # followers ) ratio 


Here a viable metric for a viral tweet can be the ratio between the retweets (or like) count over the followers count of the user. The idea here is that a user who doesn't have many followers, but has tweeted tweets that have garnered a lot of retweets or likes, can most definitely be considered "viral". On the other hand, a user who has many followers can have a standard high # retweets and those cannot be considered viral all the time.

**Note**: Also note that historical data for the evolution of the # of followers of a user are not easily available and are not provided by the Twitter API. So these calculated ratios do not reflect the actual ratio when the tweet has been tweeted by a user, since by then he may have gained a lot of followers.

In [None]:
viral_tweets_df_subset = original_tweets_df[['id', 'author_id', 'retweet_count', 'like_count']]

# Remove NA in follower count
users_df_subset = users_df.dropna(axis=0, subset=['followers_count'])

# Merge both on author id
tweets_users_merged_df = viral_tweets_df_subset.merge(
    right=users_df_subset[['id', 'followers_count']].set_index('id'), left_on='author_id', right_on='id')

In [None]:
tweets_users_merged_df['retweets_followers_ratio'] = tweets_users_merged_df['retweet_count'] / tweets_users_merged_df['followers_count']
tweets_users_merged_df.sort_values(by='retweets_followers_ratio')

In [None]:
import plotly.express as px

df_ratios_bigger_than_1 = tweets_users_merged_df[tweets_users_merged_df.retweets_followers_ratio > 1.0]
fig = px.histogram(
    df_ratios_bigger_than_1,
    x="retweets_followers_ratio",
    nbins=10,
    log_y=True)

fig.update_layout(
    title={
        'text': "Histogram of the distribution of the retweets/followers ratio > 1",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})


fig.show()

The histogram is not very clear, since we have rare events where the tweets garnered so much popularity wrt the popularity of the user. Those we can definitely consider as viral Maybe we can try K-means to better identify these outliers.

In [None]:
from sklearn.cluster import KMeans

n_clusters = 3
X = np.array(df_ratios_bigger_than_1[['retweet_count', 'followers_count']])
#X = np.vstack((df_ratios_bigger_than_1.retweet_count.to_numpy(), df_ratios_bigger_than_1.followers_count.to_numpy()))
#X = df_ratios_bigger_than_1.retweets_followers_ratio.to_numpy().reshape(-1, 1)
ratio_kmeans = KMeans(n_clusters=n_clusters, random_state=123).fit(X)

#np.vstack((X[:, 0], X[:, 1], ratio_kmeans.labels_)).reshape(-1, 3)
#px.scatter(ratio_kmeans, x=)
'''
plt.title(f'K-Means clustering of #retweets/#followers ratio with k={n_clusters}')
plt.xlabel('Retweets')
plt.ylabel('Followers')
plt.scatter(X[:, 0], X[:, 1], c=ratio_kmeans.labels_)
'''

In [None]:
kmeans_results_df = pd.DataFrame(X, columns=['retweet_count', 'follower_count']) 
kmeans_results_df['label'] = ratio_kmeans.labels_

In [None]:
px.scatter(kmeans_results_df, x='follower_count', y='retweet_count', color='label')


#### 3.3 - Metric (# Retweets  / avg #retweets of a user)

In [None]:
# avg_nb_retweets_per_user = viral_tweets_df_subset.groupby(by='author_id').agg({'retweet_count': ['min', 'mean', 'max'], 'like_count': ['min', 'mean', 'max']})
avg_nb_retweets_per_user = viral_tweets_df_subset.groupby(by='author_id').retweet_count.agg(['min', 'mean', 'max'])
avg_nb_retweets_per_user

In [None]:
ratio_retweet_avg_retweets_df = viral_tweets_df_subset.merge(avg_nb_retweets_per_user, on='author_id')
ratio_retweet_avg_retweets_df['per_user_performance'] = ratio_retweet_avg_retweets_df['retweet_count'] / ratio_retweet_avg_retweets_df['mean']
ratio_retweet_avg_retweets_df

In [None]:
bigger_than_mean = ratio_retweet_avg_retweets_df[ratio_retweet_avg_retweets_df.per_user_performance > 1]
hist = px.histogram(bigger_than_mean, x='per_user_performance', log_y=True)

hist.update_layout(title_text="Distribution of tweet performance wrt avg #retweets per user", xaxis_title="Tweet performance", yaxis_title="log count")

**Finding**: We established another metric by which we can judge the virality of a tweet, namely the number of retweets vs the average number of retweets per user. We can set a threshold (e.g. > 2) to decide whether a tweet is viral or not. We can also conduct further analysis over those tweets to determine what sets them apart from the others.

#### 3.4 - Tweet Topic (context annotations)

What topics are available? Context annotations are Twitter's version of analyzing the topic of a tweet. They are defined as a context **domain** and **entity**. The domain is like a general topic and entity is like a subtopic or a specific topic within the general domain.

In [None]:
import json 

tweets_with_topics = original_tweets_df.dropna(axis=0, subset='context_annotations')

def topic_to_json(x):
    try:
        return json.loads(x.replace('\'', '"'))
    except json.JSONDecodeError:
        print("Nope")
        return []

TODO tomorrow:
- Try sample and make it work with context annotations.
- Check if has media is not null
- hashtags extract tags
- Extract context annotations
- Use Celia Bearer Token

In [None]:
from tweepy import Paginator, TooManyRequests
client = twitter_client_wrapper.client
#tweet_data = twitter_client_wrapper.client.get_users_tweets(id='1584975692126900225', expansions=EXPANSIONS, user_fields=USER_FIELDS, tweet_fields=TWEET_FIELDS, media_fields=MEDIA_FIELDS, exclude='retweets')

viral_users_tweets = []
# Number of users processed so far
try:
    for tweet in Paginator(client.get_users_tweets, id='1482846121517096961', tweet_fields=TWEET_FIELDS, exclude="retweets").flatten(limit=20):
        viral_users_tweets.append(tweet.data)
except TooManyRequests:
    print("Hit Rate Limit")


In [None]:
domains = {}
entities = {}
for tweet in viral_users_tweets:
    context_annotations = tweet.get('context_annotations', [])
    tweet_topic_domains = dict([(topic['domain']['id'], topic['domain']) for topic in context_annotations])
    domains.update(tweet_topic_domains)
    tweet_topic_entities = dict([(topic['entity']['id'], topic['entity']) for topic in context_annotations])
    entities.update(tweet_topic_entities)
    tweet['topic_domain'] = list(tweet_topic_domains.keys())
    tweet['topic_entity'] = list(tweet_topic_entities.keys())
    tweet.pop('context_annotations', None)

In [None]:
import pickle

with open('topic_domains.pickle', 'wb') as handle:
    pickle.dump(entities, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('topic_domains.pickle', 'rb') as handle:
    b = pickle.load(handle)

b

In [None]:
try:
    with open('topic_domains.pickle', 'rb') as handle:
        topic_domains = pickle.load(handle)
except FileNotFoundError:
    topic_domains = {}

topic_domains

In [None]:
temp = pd.json_normalize(viral_users_tweets)
#temp[temp.context_annotations.notna()]
temp

In [None]:
domains

In [None]:
s = pd.Series([b[item]['name'] for items in temp.topic_domain.values for item in items])
s.groupby(s).count().sort_values()

In [None]:
viral_users_tweets_2 = []
# Number of users processed so far
try:
    for tweet in Paginator(client.get_users_tweets, id='848263392943058944', tweet_fields=TWEET_FIELDS, exclude="retweets").flatten(limit=100):
        viral_users_tweets_2.append(tweet.data)
except TooManyRequests:
    print("Hit Rate Limit")

In [None]:
domains = {}
entities = {}
for tweet in viral_users_tweets_2:
    context_annotations = tweet.get('context_annotations', [])
    tweet_topic_domains = dict([(topic['domain']['id'], topic['domain']) for topic in context_annotations])
    domains.update(tweet_topic_domains)
    tweet_topic_entities = dict([(topic['entity']['id'], topic['entity']) for topic in context_annotations])
    entities.update(tweet_topic_entities)
    tweet['topic_domain'] = list(tweet_topic_domains.keys()) if len(tweet_topic_domains.keys()) > 0 else pd.NA
    tweet['topic_entity'] = list(tweet_topic_entities.keys()) if len(tweet_topic_entities.keys()) > 0 else pd.NA
    #tweet.pop('context_annotations', None)

In [None]:
temp2_df = pd.json_normalize(viral_users_tweets_2)
first_context = temp2_df[~temp2_df.topic_domain.isna()].topic_domain.iloc[2]

In [None]:
temp2_df[~temp2_df['entities.hashtags'].isna()]

In [None]:
temp2_df.to_csv("temp.csv", index=False)

In [None]:
import ast

temp2_read = pd.read_csv('temp.csv', converters={'context_annotations': lambda x: eval(x) if (x and len(x) > 0) else np.nan})
first_context = temp2_read[~temp2_read.context_annotations.isna()].context_annotations.iloc[2]
first_context

In [None]:
eval(first_context)

In [None]:
def format_context_annotations(context_annotations):
    if (pd.isna(context_annotations)):
        return []
    else:
        return json.loads(context_annotations)

temp2_df.context_annotations.apply(format_context_annotations)

In [None]:
pd.DataFrame(viral_users_tweets_2, columns=TWEET_FIELDS).to_csv('temp_2.csv', index=False)

In [None]:
#tweet_data = twitter_client_wrapper.client.get_tweet(id='1584975692126900225', expansions=EXPANSIONS, user_fields=USER_FIELDS, tweet_fields=TWEET_FIELDS, media_fields=MEDIA_FIELDS)
bytes(tweets_with_topics.iloc[1000].context_annotations, encoding='utf-8').decode('unicode_escape')

In [6]:
dtypes={"id": str, "author_id": str, "has_media": bool, "possibly_sensitive": bool, "has_hashtags": bool}
temp3 = pd.read_csv("145371604-to-146944733.csv", dtype=dtypes)
d = temp3[~temp3.topic_domains.isna()].topic_domains.iloc[0]
eval(d)[0]

'46'

#### 3.5 - Tweet Sentiment

#### 3.6 - Possibly sensitive

#### 3.7 - Hashtags

In [None]:
# TODO: has hashtags (using entities.hashtags)

#### 3.8 - Text preprocessing

TODO:
- Sort by tweet date (check popularity)
- Use Twitter lists to try and find
- Check if reply or retweet