import streamlit as st from PIL import Image st.title("Latent Dirichlet Allocation (LDA)") st.header("Research & Methodology", divider = "blue") st.markdown(""" ### Purpose We used LDA as a baseline model to detect and analyze climate anxiety among youth on social media platforms like Reddit and Twitter (X). Being a well-established probabilistic topic modeling method, it's well-suited for identifying high-level thematic structures in large text corpora. The performance of our LDA model will serve as a benchmark for the more advanced BERTopic model. This will allow us to evaluate the added value of BERTopic’s contextual embeddings and clustering capabilities compared to LDA’s traditional bag-of-words approach. --- ### Process Flow The analysis pipeline involved collecting and cleaning climate-related text data from Reddit and Twitter, followed by a preprocessing phase that included normalization, lemmatization, and customized filtering to preserve topic-relevant terms. The cleaned data was then transformed into a document-term matrix using bag-of-words techniques with frequency-based term filtering. Topic modeling was performed using LDA, optimized through hyperparameter tuning, and evaluated via coherence and perplexity metrics. Finally, insights were drawn by comparing topic coherence and granularity to better understand how climate anxiety is expressed in different online communities. """) pdf_path = "documents/LDA Documentation.pdf" with open(pdf_path, "rb") as file: st.download_button("Download documentation", file, file_name="LDA Documentation.pdf") st.header("Results", divider="green") st.markdown(""" ### Coherence and Perplexity Scores Analysis Coherence scores measure the semantic similarity between words within each topic. A higher coherence score (ranging from 0 to 1) suggests that the topics are more meaningful and interpretable. - Reddit model: Coherence score of 0.38 → Moderate interpretability, because of overlap between topics or less distinct themes. - Twitter model: Coherence score of 0.43 → Topics are more consistent, with clearer distinctions between them and easier interpretability. Perplexity measures how well the model predicts unseen words in the corpus. Lower perplexity generally indicates better generalization of topics. - Reddit model: Perplexity score of 2874.72 → Model struggles with generalization and most likely overfitting. - Twitter model: Perplexity score of 1865.36 → Generalizes and captures patterns in data better than the reddit model, but still overfits. --- ### Why the Twitter Model Performs Better Several factors may contribute to the superior performance of the Twitter model: 1. Twitter's character limit leads to more concise and direct language. This increases linguistic similarity within topics, making them easier to cluster and interpret. 2. Reddit posts and comments are longer and often multifaceted, blending several ideas into one post. This makes topic separation more difficult and increases ambiguity. Overall, the Twitter model outperforms the Reddit model due to the focused and streamlined nature of tweets, which allow for clearer topic boundaries and improved coherence and perplexity scores. Reddit's complex and multifaceted discourse poses challenges for traditional topic modeling, resulting in more blended and less distinct topics. """) st.header("Insights in Climate Focus", divider="red") reddit_vis = Image.open("visualizations/lda_reddit_twd.png") twitter_vis = Image.open("visualizations/lda_twitter_twd.png") col1, col2 = st.columns(2) with col1: st.image(reddit_vis, caption="Reddit Topic-Word Distribution") with col2: st.image(twitter_vis, caption="Twitter Topic-Word Distribution") st.markdown(""" By analyzing key terms and themes within each platform's topic-word distribution, we can better understand the unique aspects each platform emphasizes in the broader climate discourse. --- ### Reddit's Focus Reddit’s discussions are generally analytical, scientific, and policy-driven, exhibiting the following trends: - Frequent references to climate models, temperature trends, and global warming projections. - Strong emphasis on government policies, sustainability initiatives, and legislative debates regarding climate regulations. - Active discussion around solar, wind, and other alternative energy innovations. - Coverage of grassroots environmental movements, activism tactics, and advocacy campaigns. - While largely pro-science, some threads engage with climate skepticism, often aiming to debunk misinformation. --- ### Twitter's Focus Twitter (or X) tends to be more event-driven, highlighting immediate climate developments and their social impacts: - High volume of posts about hurricanes, floods, wildfires, and other disasters linked to climate change. - Trending hashtags, viral campaigns, and calls to action dominate the climate narrative. - Tweets often focus on statements and actions by politicians, corporations, and influencers. - Strong emphasis on how climate change disproportionately affects marginalized communities. - Real-time responses to events take precedence over long-term scientific forecasts. Reddit and Twitter offer complementary perspectives on climate discourse. Reddit tends to focus on research-oriented, long-form content, emphasizing data-driven science, legislative solutions, and long-term climate projections. In contrast, Twitter (X) centers on real-time events and social reactions, often highlighting individual actions, corporate accountability, and justice-driven activism. While Reddit fosters in-depth, technical discussions, Twitter amplifies public awareness through dynamic, media-rich narratives. Together, these platforms provide a multifaceted view of how climate anxiety and engagement manifest across different online communities. """)