File size: 6,176 Bytes
893c91b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
import streamlit as st
from PIL import Image

st.title("Latent Dirichlet Allocation (LDA)")

st.header("Research & Methodology", divider = "blue")

st.markdown("""
### Purpose
            
We used LDA as a baseline model to detect and analyze climate anxiety among youth on social media platforms like Reddit and Twitter (X). 
            Being a well-established probabilistic topic modeling method, it's well-suited for identifying high-level thematic structures 
            in large text corpora. The performance of our LDA model will serve as a benchmark for the more advanced BERTopic model. 
            This will allow us to evaluate the added value of BERTopic’s contextual embeddings and clustering capabilities compared 
            to LDA’s traditional bag-of-words approach.

---

### Process Flow
            
The analysis pipeline involved collecting and cleaning climate-related text data from Reddit and Twitter, followed by a preprocessing 
            phase that included normalization, lemmatization, and customized filtering to preserve topic-relevant terms. The cleaned data was 
            then transformed into a document-term matrix using bag-of-words techniques with frequency-based term filtering. Topic modeling was 
            performed using LDA, optimized through hyperparameter tuning, and evaluated via coherence and perplexity 
            metrics. Finally, insights were drawn by comparing topic coherence and granularity to better understand how climate anxiety is expressed 
            in different online communities.

""")

pdf_path = "documents/LDA Documentation.pdf"
with open(pdf_path, "rb") as file:
    st.download_button("Download documentation", file, file_name="LDA Documentation.pdf")



st.header("Results", divider="green")

st.markdown("""
### Coherence and Perplexity Scores Analysis

Coherence scores measure the semantic similarity between words within each topic. A higher coherence score (ranging from 0 to 1) suggests that the topics are more meaningful and interpretable.

- Reddit model: Coherence score of 0.38  
  → Moderate interpretability, because of overlap between topics or less distinct themes.

- Twitter model: Coherence score of 0.43  
  → Topics are more consistent, with clearer distinctions between them and easier interpretability.

Perplexity measures how well the model predicts unseen words in the corpus. Lower perplexity generally indicates better generalization of topics.

- Reddit model: Perplexity score of 2874.72  
  → Model struggles with generalization and most likely overfitting.

- Twitter model: Perplexity score of 1865.36  
  → Generalizes and captures patterns in data better than the reddit model, but still overfits.
---

### Why the Twitter Model Performs Better

Several factors may contribute to the superior performance of the Twitter model:

1. Twitter's character limit leads to more concise and direct language. This increases linguistic similarity within topics, 
            making them easier to cluster and interpret.

2. Reddit posts and comments are longer and often multifaceted, blending several ideas into one post. This makes topic separation 
            more difficult and increases ambiguity.


Overall, the Twitter model outperforms the Reddit model due to the focused and streamlined nature of tweets, which allow for clearer topic 
            boundaries and improved coherence and perplexity scores. Reddit's complex and multifaceted discourse poses challenges for 
            traditional topic modeling, resulting in more blended and less distinct topics.
""")



st.header("Insights in Climate Focus", divider="red")

reddit_vis = Image.open("visualizations/lda_reddit_twd.png")


twitter_vis = Image.open("visualizations/lda_twitter_twd.png")

col1, col2 = st.columns(2)
with col1:
    st.image(reddit_vis, caption="Reddit Topic-Word Distribution")

with col2:
    st.image(twitter_vis, caption="Twitter Topic-Word Distribution")

st.markdown(""" By analyzing key terms and themes within each platform's topic-word distribution, we 
            can better understand the unique aspects each platform emphasizes in the broader climate discourse.

---

### Reddit's Focus

Reddit’s discussions are generally analytical, scientific, and policy-driven, exhibiting the following trends:

- Frequent references to climate models, temperature trends, and global warming projections.
- Strong emphasis on government policies, sustainability initiatives, and legislative debates regarding climate regulations.
- Active discussion around solar, wind, and other alternative energy innovations.
- Coverage of grassroots environmental movements, activism tactics, and advocacy campaigns.
- While largely pro-science, some threads engage with climate skepticism, often aiming to debunk misinformation.

---

### Twitter's Focus

Twitter (or X) tends to be more event-driven, highlighting immediate climate developments and their social impacts:

- High volume of posts about hurricanes, floods, wildfires, and other disasters linked to climate change.
- Trending hashtags, viral campaigns, and calls to action dominate the climate narrative.
- Tweets often focus on statements and actions by politicians, corporations, and influencers.
- Strong emphasis on how climate change disproportionately affects marginalized communities.
- Real-time responses to events take precedence over long-term scientific forecasts.

Reddit and Twitter offer complementary perspectives on climate discourse. Reddit tends to focus on research-oriented, 
            long-form content, emphasizing data-driven science, legislative solutions, and long-term climate projections. 
            In contrast, Twitter (X) centers on real-time events and social reactions, often highlighting individual actions, 
            corporate accountability, and justice-driven activism. While Reddit fosters in-depth, technical discussions, 
            Twitter amplifies public awareness through dynamic, media-rich narratives. Together, these platforms provide 
            a multifaceted view of how climate anxiety and engagement manifest across different online communities.
""")