| import streamlit as st | |
| from PIL import Image | |
| st.title("Latent Dirichlet Allocation (LDA)") | |
| st.header("Research & Methodology", divider = "blue") | |
| st.markdown(""" | |
| ### Purpose | |
| We used LDA as a baseline model to detect and analyze climate anxiety among youth on social media platforms like Reddit and Twitter (X). | |
| Being a well-established probabilistic topic modeling method, it's well-suited for identifying high-level thematic structures | |
| in large text corpora. The performance of our LDA model will serve as a benchmark for the more advanced BERTopic model. | |
| This will allow us to evaluate the added value of BERTopic’s contextual embeddings and clustering capabilities compared | |
| to LDA’s traditional bag-of-words approach. | |
| --- | |
| ### Process Flow | |
| The analysis pipeline involved collecting and cleaning climate-related text data from Reddit and Twitter, followed by a preprocessing | |
| phase that included normalization, lemmatization, and customized filtering to preserve topic-relevant terms. The cleaned data was | |
| then transformed into a document-term matrix using bag-of-words techniques with frequency-based term filtering. Topic modeling was | |
| performed using LDA, optimized through hyperparameter tuning, and evaluated via coherence and perplexity | |
| metrics. Finally, insights were drawn by comparing topic coherence and granularity to better understand how climate anxiety is expressed | |
| in different online communities. | |
| """) | |
| pdf_path = "documents/LDA Documentation.pdf" | |
| with open(pdf_path, "rb") as file: | |
| st.download_button("Download documentation", file, file_name="LDA Documentation.pdf") | |
| st.header("Results", divider="green") | |
| st.markdown(""" | |
| ### Coherence and Perplexity Scores Analysis | |
| Coherence scores measure the semantic similarity between words within each topic. A higher coherence score (ranging from 0 to 1) suggests that the topics are more meaningful and interpretable. | |
| - Reddit model: Coherence score of 0.38 | |
| → Moderate interpretability, because of overlap between topics or less distinct themes. | |
| - Twitter model: Coherence score of 0.43 | |
| → Topics are more consistent, with clearer distinctions between them and easier interpretability. | |
| Perplexity measures how well the model predicts unseen words in the corpus. Lower perplexity generally indicates better generalization of topics. | |
| - Reddit model: Perplexity score of 2874.72 | |
| → Model struggles with generalization and most likely overfitting. | |
| - Twitter model: Perplexity score of 1865.36 | |
| → Generalizes and captures patterns in data better than the reddit model, but still overfits. | |
| --- | |
| ### Why the Twitter Model Performs Better | |
| Several factors may contribute to the superior performance of the Twitter model: | |
| 1. Twitter's character limit leads to more concise and direct language. This increases linguistic similarity within topics, | |
| making them easier to cluster and interpret. | |
| 2. Reddit posts and comments are longer and often multifaceted, blending several ideas into one post. This makes topic separation | |
| more difficult and increases ambiguity. | |
| Overall, the Twitter model outperforms the Reddit model due to the focused and streamlined nature of tweets, which allow for clearer topic | |
| boundaries and improved coherence and perplexity scores. Reddit's complex and multifaceted discourse poses challenges for | |
| traditional topic modeling, resulting in more blended and less distinct topics. | |
| """) | |
| st.header("Insights in Climate Focus", divider="red") | |
| reddit_vis = Image.open("visualizations/lda_reddit_twd.png") | |
| twitter_vis = Image.open("visualizations/lda_twitter_twd.png") | |
| col1, col2 = st.columns(2) | |
| with col1: | |
| st.image(reddit_vis, caption="Reddit Topic-Word Distribution") | |
| with col2: | |
| st.image(twitter_vis, caption="Twitter Topic-Word Distribution") | |
| st.markdown(""" By analyzing key terms and themes within each platform's topic-word distribution, we | |
| can better understand the unique aspects each platform emphasizes in the broader climate discourse. | |
| --- | |
| ### Reddit's Focus | |
| Reddit’s discussions are generally analytical, scientific, and policy-driven, exhibiting the following trends: | |
| - Frequent references to climate models, temperature trends, and global warming projections. | |
| - Strong emphasis on government policies, sustainability initiatives, and legislative debates regarding climate regulations. | |
| - Active discussion around solar, wind, and other alternative energy innovations. | |
| - Coverage of grassroots environmental movements, activism tactics, and advocacy campaigns. | |
| - While largely pro-science, some threads engage with climate skepticism, often aiming to debunk misinformation. | |
| --- | |
| ### Twitter's Focus | |
| Twitter (or X) tends to be more event-driven, highlighting immediate climate developments and their social impacts: | |
| - High volume of posts about hurricanes, floods, wildfires, and other disasters linked to climate change. | |
| - Trending hashtags, viral campaigns, and calls to action dominate the climate narrative. | |
| - Tweets often focus on statements and actions by politicians, corporations, and influencers. | |
| - Strong emphasis on how climate change disproportionately affects marginalized communities. | |
| - Real-time responses to events take precedence over long-term scientific forecasts. | |
| Reddit and Twitter offer complementary perspectives on climate discourse. Reddit tends to focus on research-oriented, | |
| long-form content, emphasizing data-driven science, legislative solutions, and long-term climate projections. | |
| In contrast, Twitter (X) centers on real-time events and social reactions, often highlighting individual actions, | |
| corporate accountability, and justice-driven activism. While Reddit fosters in-depth, technical discussions, | |
| Twitter amplifies public awareness through dynamic, media-rich narratives. Together, these platforms provide | |
| a multifaceted view of how climate anxiety and engagement manifest across different online communities. | |
| """) |