Spaces:

hansche
/

SocialMediaFoci

Sleeping

App Files Files Community

updating space

by afanyu237 - opened Dec 5, 2025

base: refs/heads/main

←

from: refs/pr/6

Discussion Files changed

+893

-1224

Files changed (21) hide show

.DS_Store +0 -0
.env +0 -1
.gitignore +0 -10
.gitignore.save +0 -3
README.md +147 -19
app.py +399 -255
debug_regex.py +0 -23
finetune.py +0 -102
helper.py +91 -195
naive_bayes_model.pkl +0 -3
openrouter_chat.py +0 -91
preprocessor.py +187 -165
profile_performance.py +0 -70
reproduce_issue.py +0 -27
requirements.txt +1 -2
sentiment.py +68 -58
sentiment_train.py +0 -41
test.py +0 -67
tfidf_vectorizer.pkl +0 -3
verify_fix.py +0 -48
verify_refactor.py +0 -41

.DS_Store DELETED Viewed

Binary file (6.15 kB)

.env DELETED Viewed

	@@ -1 +0,0 @@
1	- OPENROUTER_API_KEY="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"

.gitignore DELETED Viewed

@@ -1,10 +0,0 @@
-venv/
-.env/
-__pycache__/
-*.pyc
-.streamlit/secrets.toml
-.streamlit/secrets.toml
-.streamlit/secrets.toml
-.venv/
-venv/
-.venv/

.gitignore.save DELETED Viewed

	@@ -1,3 +0,0 @@
1	- .venv/
2	-
3	-

README.md CHANGED Viewed

@@ -1,25 +1,153 @@
----
-title: WhatsApp Chat
-emoji: 😻
-colorFrom: pink
-colorTo: gray
-sdk: streamlit
-sdk_version: 1.44.1
-app_file: app.py
-pinned: false
-license: mit
-short_description: 📊 WhatsApp Chat Sentiment Analysis
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
-## Running Locally
-1. Install dependencies:
    ```bash
    pip install -r requirements.txt
    ```
-2. Run the application:
    ```bash
-   streamlit run app.py
    ```

+# WhatsApp Chat Analyzer
+A comprehensive tool for analyzing WhatsApp chat exports with sentiment analysis capabilities.
+## Table of Contents
+1. [System Overview](#system-overview)
+2. [Architecture](#architecture)
+3. [Components](#components)
+4. [Data Flow](#data-flow)
+5. [Installation](#installation)
+6. [Usage](#usage)
+7. [Analysis Capabilities](#analysis-capabilities)
+8.
+## System Overview
+The WhatsApp Chat Analyzer is a Python-based application that processes exported WhatsApp chat data to provide:
+- Message statistics and metrics
+- Temporal activity patterns
+- User engagement analysis
+- Content analysis (words, emojis, links)
+- Sentiment analysis capabilities
+- Topics analysis in the group chats
+Built with Streamlit for the web interface, it offers an interactive way to explore chat dynamics and analyze sentiment.
+## Architecture
+The system follows a modular architecture with clear separation of concerns:
+```
+Raw WhatsApp Chat → Preprocessing → Analysis → Visualization
+```
+Key architectural decisions:
+- **Modular Design**: Components are separated by functionality
+- **Pipeline Processing**: Data flows through discrete processing stages
+- **Interactive UI**: Streamlit enables real-time exploration
+## Components
+### 1. App Module (`app.py`)
+- **Responsibility**: User interface and visualization
+- **Key Features**:
+  - File upload handling
+  - User selection interface
+  - Visualization rendering
+  - Interactive controls
+### 2. Preprocessor (`preprocessor.py`)
+- **Responsibility**: Data cleaning and structuring
+- **Key Features**:
+  - Handles multiple date/time formats
+  - Extracts messages and metadata
+  - Filters system messages
+  - Creates structured DataFrame
+### 3. Helper Module (`helper.py`)
+- **Responsibility**: Analytical computations
+- **Key Features**:
+  - Statistical metrics
+  - Temporal analysis
+  - Content analysis
+  - Visualization data preparation
+### 4. Notebook (`whatsAppAnalyzer.ipynb`)
+- **Responsibility**: Prototyping and experimentation
+- **Key Features**:
+  - Initial pattern development
+  - Data exploration
+  - Algorithm testing
+## Data Flow
+1. **Input**: User uploads WhatsApp chat export (.txt)
+2. **Preprocessing**:
+   - Raw text is parsed using regex patterns
+   - Messages are categorized and timestamped
+   - Structured DataFrame is created
+3. **Analysis**:
+   - Selected metrics are computed
+   - Temporal patterns are identified
+   - Content features are extracted
+4. **Visualization**:
+   - Results are displayed in interactive charts
+   - User can explore different views
+## Installation
+### Prerequisites
+- Python 3.8+
+- pip package manager
+### Steps
+1. Clone the repository:
+   ```bash
+   git clone [repository-url]
+   cd whatsapp-analyzer
+   ```
+2. Install dependencies:
    ```bash
    pip install -r requirements.txt
    ```
+3. Run the application:
    ```bash
+   streamlit run srcs/app.py
    ```
+## Usage
+1. Launch the application
+2. Upload a WhatsApp chat export file
+3. Select a user or "Overall" for group analysis
+4. Explore the various analysis tabs:
+   - Statistics
+   - Timelines
+   - Activity Maps
+   - Word Clouds
+   - Emoji Analysis
+## Analysis Capabilities
+### 1. Basic Statistics
+- Message counts
+- Word counts
+- Media shared
+- Links shared
+### 2. Temporal Analysis
+- Daily activity patterns
+- Monthly trends
+- Hourly distributions
+### 3. User Engagement
+- Most active users
+- User participation rates
+- Message distribution
+### 4. Content Analysis
+- Most common words
+- Emoji usage
+### 5. Sentiment Analysis
+- Message sentiment scoring
+- Sentiment trends over time
+- User sentiment comparison
+## 5. Topics Analysis
+- Topic modeling
+- Common topics over time
+- User interests

app.py CHANGED Viewed

@@ -1,16 +1,15 @@
 import streamlit as st
 import pandas as pd
 import matplotlib.pyplot as plt
-import os
-# Silence tokenizers warning
-os.environ["TOKENIZERS_PARALLELISM"] = "false"
 import seaborn as sns
 import preprocessor, helper
-import calendar
 # Theme customization
-st.set_page_config(page_title="WhatsApp Chat Analyzer", layout="wide")
 st.markdown(
     """
     <style>
@@ -20,275 +19,420 @@ st.markdown(
     unsafe_allow_html=True
 )
 st.title("📊 WhatsApp Chat Sentiment Analysis Dashboard")
 st.subheader('Instructions')
-st.markdown("1. Open the side bar and upload your WhatsApp chat file in .txt format.")
-st.markdown("2. Wait for it to load")
-st.markdown("3. Once the data is loaded, you can customize the analysis by selecting specific users or filtering the data.")
-st.markdown("4. Click on the 'Show Analysis' button to update the analysis with your selected filters.")
 st.sidebar.title("Whatsapp Chat Analyzer")
-# OpenRouter API Key is now handled via .env and openrouter_chat.py
-# No need to request HF token from user
-uploaded_file = st.sidebar.file_uploader("Upload your chat file (.txt)", type="txt")
 if uploaded_file is not None:
     raw_data = uploaded_file.read().decode("utf-8")
-    # Step 1: Fast Parsing (Lazy Loading)
-    @st.cache_data
-    def load_parsed_data(data):
-        return preprocessor.parse_data(data)
-    df = load_parsed_data(raw_data)
-    # Sidebar filters
     st.sidebar.header("🔍 Filters")
-    user_list = df['user'].unique().tolist()
-    if 'group_notification' in user_list:
-        user_list.remove('group_notification')
-    user_list.sort()
-    user_list.insert(0, "Overall")
-    selected_user = st.sidebar.selectbox("Show analysis wrt", user_list)
     if st.sidebar.button("Show Analysis"):
-        # Basic Stats (Instant)
-        num_messages, words, num_media_messages, num_links = helper.fetch_stats(selected_user, df)
-        st.title("Top Statistics")
-        col1, col2, col3, col4 = st.columns(4)
-        with col1:
-            st.header("Total Messages")
-            st.title(num_messages)
-        with col2:
-            st.header("Total Words")
-            st.title(words)
-        with col3:
-            st.header("Media Shared")
-            st.title(num_media_messages)
-        with col4:
-            st.header("Links Shared")
-            st.title(num_links)
-        # Monthly Timeline
-        st.title("Monthly Timeline")
-        timeline = helper.monthly_timeline(selected_user, df)
-        fig, ax = plt.subplots()
-        ax.plot(timeline['time'], timeline['message'], color='green')
-        plt.xticks(rotation='vertical')
-        st.pyplot(fig)
-        # Daily Timeline
-        st.title("Daily Timeline")
-        daily_timeline = helper.daily_timeline(selected_user, df)
-        fig, ax = plt.subplots()
-        ax.plot(daily_timeline['date'], daily_timeline['message'], color='black')
-        plt.xticks(rotation='vertical')
-        st.pyplot(fig)
-        # Activity Map
-        st.title('Activity Map')
-        col1, col2 = st.columns(2)
-        with col1:
-            st.header("Most busy day")
-            busy_day = helper.week_activity_map(selected_user, df)
-            fig, ax = plt.subplots()
-            ax.bar(busy_day.index, busy_day.values, color='purple')
-            plt.xticks(rotation='vertical')
-            st.pyplot(fig)
-        with col2:
-            st.header("Most busy month")
-            busy_month = helper.month_activity_map(selected_user, df)
-            fig, ax = plt.subplots()
-            ax.bar(busy_month.index, busy_month.values, color='orange')
-            plt.xticks(rotation='vertical')
-            st.pyplot(fig)
-        # st.title("Weekly Activity Map")
-        # user_heatmap = helper.activity_heatmap(selected_user, df)
-        # fig, ax = plt.subplots()
-        # ax = sns.heatmap(user_heatmap)
-        # st.pyplot(fig)
-        # Most Busy Users
-        if selected_user == 'Overall':
-            st.title('Most Busy Users')
-            x, new_df = helper.most_busy_users(df)
-            fig, ax = plt.subplots()
-            col1, col2 = st.columns(2)
-            with col1:
-                ax.bar(x.index, x.values, color='red')
-                plt.xticks(rotation='vertical')
-                st.pyplot(fig)
-            with col2:
-                st.dataframe(new_df)
-        # WordCloud
-        st.title("Wordcloud")
-        df_wc = helper.create_wordcloud(selected_user, df)
-        fig, ax = plt.subplots()
-        ax.imshow(df_wc)
-        st.pyplot(fig)
-        # Most Common Words
-        st.title('Most Common Words')
-        most_common_df = helper.most_common_words(selected_user, df)
-        # Filter emojis to prevent matplotlib warnings
-        most_common_df[0] = most_common_df[0].apply(helper.remove_emojis)
-        fig, ax = plt.subplots()
-        ax.barh(most_common_df[0], most_common_df[1])
-        plt.xticks(rotation='vertical')
-        st.pyplot(fig)
-        # Emoji Analysis
-        st.title("Emoji Analysis")
-        emoji_df = helper.emoji_helper(selected_user, df)
-        col1, col2 = st.columns(2)
-        with col1:
-            st.dataframe(emoji_df)
-        with col2:
-            if not emoji_df.empty:
-                fig, ax = plt.subplots()
-                ax.pie(emoji_df[1].head(), labels=emoji_df[0].head(), autopct="%0.2f")
-                st.pyplot(fig)
-            else:
-                st.write("No emojis found")
-    # --- Deep Analysis Section (Lazy Loaded) ---
-    st.markdown("---")
-    st.header("🤖 Deep AI Analysis")
-    st.info("Analyzing Sentiment and Topics... (This may take a few seconds for large files)")
-    with st.spinner("Running AI models..."):
-        # Filter df based on selected user first
-        if selected_user != 'Overall':
-            df_to_analyze = df[df['user'] == selected_user]
-        else:
-            df_to_analyze = df
-        # Check if enough data
-        if len(df_to_analyze) < 10:
-            st.warning("Not enough data for deep analysis.")
         else:
-            # Run Analysis
-            @st.cache_data
-            def run_deep_analysis(data_frame):
-                return preprocessor.analyze_sentiment_and_topics(data_frame)
-            analyzed_df, topics = run_deep_analysis(df_to_analyze)
-            # Sentiment Analysis Visualization
-            st.title("Sentiment Analysis")
-            sentiment_counts = analyzed_df['sentiment'].value_counts()
-            col1, col2 = st.columns(2)
-            with col1:
-                st.dataframe(sentiment_counts)
-            with col2:
-                fig, ax = plt.subplots()
-                ax.pie(sentiment_counts, labels=sentiment_counts.index, autopct='%1.1f%%', startangle=90, colors=['#66b3ff','#99ff99','#ffcc99'])
-                ax.axis('equal')
-                st.pyplot(fig)
-            # Sentiment by Month
-            st.write("### Sentiment Count by Month")
-            # Convert month names to abbreviated format
-            month_map = {
-                'January': 'Jan', 'February': 'Feb', 'March': 'Mar', 'April': 'Apr',
-                'May': 'May', 'June': 'Jun', 'July': 'Jul', 'August': 'Aug',
-                'September': 'Sep', 'October': 'Oct', 'November': 'Nov', 'December': 'Dec'
-            }
-            analyzed_df['month'] = analyzed_df['month'].map(month_map)
-            monthly_sentiment = analyzed_df.groupby(['month', 'sentiment']).size().unstack(fill_value=0)
-            fig, axes = plt.subplots(1, 3, figsize=(18, 5))
-            # Plot Positive Sentiment
-            if 'positive' in monthly_sentiment.columns:
-                axes[0].bar(monthly_sentiment.index, monthly_sentiment['positive'], color='green')
-            axes[0].set_title('Positive Sentiment')
-            # Plot Neutral Sentiment
-            if 'neutral' in monthly_sentiment.columns:
-                axes[1].bar(monthly_sentiment.index, monthly_sentiment['neutral'], color='blue')
-            axes[1].set_title('Neutral Sentiment')
-            # Plot Negative Sentiment
-            if 'negative' in monthly_sentiment.columns:
-                axes[2].bar(monthly_sentiment.index, monthly_sentiment['negative'], color='red')
-            axes[2].set_title('Negative Sentiment')
-            st.pyplot(fig)
-            # Topic Analysis Visualization
-            if len(topics) > 0:
-                st.title("Topic Analysis")
-                fig = helper.plot_topics(topics)
-                st.pyplot(fig)
-                # Prepare data for new title generation
-                topic_messages_map = {}
-                for topic_id in analyzed_df['topic'].unique():
-                    # Get all messages for this topic
-                    msgs = analyzed_df[analyzed_df['topic'] == topic_id]['message'].tolist()
-                    # Select a sample (e.g., top 10 or random 10) to send to AI
-                    # Here we take the first 10 for simplicity, or random could be better
-                    topic_messages_map[topic_id] = msgs[:10]
-                # Generate titles from messages
-                with st.spinner("Generating descriptive topic titles..."):
-                    topic_titles_map = helper.generate_topic_titles_from_messages(topic_messages_map)
-                # Convert map to list for plotting compatibility (or update plotting to take dict)
-                # plot_topics expects a list corresponding to the topics list order
-                # The topics list from LDA is ordered by topic index 0, 1, 2...
-                custom_titles = [topic_titles_map.get(i, f"Topic {i}") for i in range(len(topics))]
-                st.title("Topic Analysis")
-                fig = helper.plot_topics(topics, custom_titles=custom_titles)
                 st.pyplot(fig)
                 # Display Sample Messages for Each Topic
                 st.header("Sample Messages for Each Topic")
-                for idx, topic_id in enumerate(analyzed_df['topic'].unique()):
-                    title = topic_titles_map.get(topic_id, f"Topic {topic_id}")
-                    st.subheader(title)
-                    filtered_messages = analyzed_df[analyzed_df['topic'] == topic_id]['message']
-                    for msg in filtered_messages.head(5):
-                        st.text(f"- {msg}")
-            else:
-                st.warning("No topics found for visualization.")
-            # Clustering Analysis
-            st.title("Clustering Analysis")
-            n_clusters = st.slider("Select Number of Clusters", min_value=2, max_value=10, value=5)
-            # Perform clustering on analyzed_df (which has lemmatized_message)
-            clustered_df, reduced_features, _ = preprocessor.preprocess_for_clustering(analyzed_df, n_clusters=n_clusters)
-            st.header("Cluster Visualization")
-            fig = helper.plot_clusters(reduced_features, clustered_df['cluster'])
-            st.pyplot(fig)
-            # Cluster Insights
-            st.header("Insights from Clusters")
-            st.subheader("1. Dominant Conversation Themes")
-            cluster_labels = helper.get_cluster_labels(clustered_df, n_clusters)
-            for cluster_id, label in cluster_labels.items():
-                st.write(f"**Cluster {cluster_id}**: {label}")
-            st.subheader("2. Actionable Recommendations")
-            recommendations = helper.generate_recommendations(clustered_df)
-            for recommendation in recommendations:
-                st.write(f"- {recommendation}")

 import streamlit as st
+st.set_page_config(page_title="WhatsApp Chat Analyzer", layout="wide")
 import pandas as pd
 import matplotlib.pyplot as plt
 import seaborn as sns
 import preprocessor, helper
+from sentiment import predict_sentiment_batch
+import os
+os.environ["STREAMLIT_SERVER_RUN_ON_SAVE"] = "false"
 # Theme customization
 st.markdown(
     """
     <style>
     unsafe_allow_html=True
 )
+# Set seaborn style
+sns.set_theme(style="whitegrid")
 st.title("📊 WhatsApp Chat Sentiment Analysis Dashboard")
 st.subheader('Instructions')
+st.markdown("1. Open the sidebar and upload your WhatsApp chat file in .txt format.")
+st.markdown("2. Wait for the initial processing (minimal delay).")
+st.markdown("3. Customize the analysis by selecting users or filters.")
+st.markdown("4. Click 'Show Analysis' for detailed results.")
 st.sidebar.title("Whatsapp Chat Analyzer")
+uploaded_file = st.sidebar.file_uploader("Upload your chat file (.txt)", type="txt")
+@st.cache_data
+def load_and_preprocess(file_content):
+    return preprocessor.preprocess(file_content)
 if uploaded_file is not None:
     raw_data = uploaded_file.read().decode("utf-8")
+    with st.spinner("Loading chat data..."):
+        df, _ = load_and_preprocess(raw_data)
+    st.session_state.df = df
     st.sidebar.header("🔍 Filters")
+    user_list = ["Overall"] + sorted(df["user"].unique().tolist())
+    selected_user = st.sidebar.selectbox("Select User", user_list)
+    df_filtered = df if selected_user == "Overall" else df[df["user"] == selected_user]
     if st.sidebar.button("Show Analysis"):
+        if df_filtered.empty:
+            st.warning(f"No data found for user: {selected_user}")
         else:
+            with st.spinner("Analyzing..."):
+                if 'sentiment' not in df_filtered.columns:
+                    try:
+                        print("Starting sentiment analysis...")
+                        # Get messages as clean strings
+                        message_list = df_filtered["message"].astype(str).tolist()
+                        message_list = [msg for msg in message_list if msg.strip()]
+                        print(f"Processing {len(message_list)} messages")
+                        print(f"Sample messages: {message_list[:5]}")
+                        # Directly call the sentiment analysis function
+                        df_filtered['sentiment'] = predict_sentiment_batch(message_list)
+                        print("Sentiment analysis completed successfully")
+                    except Exception as e:
+                        st.error(f"Sentiment analysis failed: {str(e)}")
+                        print(f"Full error: {str(e)}")
+                    st.session_state.df_filtered = df_filtered
+                else:
+                    st.session_state.df_filtered = df_filtered
+                # Display statistics and visualizations
+                num_messages, words, num_media, num_links = helper.fetch_stats(selected_user, df_filtered)
+                st.title("Top Statistics")
+                col1, col2, col3, col4 = st.columns(4)
+                with col1:
+                    st.header("Total Messages")
+                    st.title(num_messages)
+                with col2:
+                    st.header("Total Words")
+                    st.title(words)
+                with col3:
+                    st.header("Media Shared")
+                    st.title(num_media)
+                with col4:
+                    st.header("Links Shared")
+                    st.title(num_links)
+                st.title("Monthly Timeline")
+                timeline = helper.monthly_timeline(selected_user, df_filtered.sample(min(5000, len(df_filtered))))
+                if not timeline.empty:
+                    plt.figure(figsize=(10, 5))
+                    sns.lineplot(data=timeline, x='time', y='message', color='green')
+                    plt.title("Monthly Timeline")
+                    plt.xlabel("Date")
+                    plt.ylabel("Messages")
+                    st.pyplot(plt)
+                    plt.clf()
+                st.title("Daily Timeline")
+                daily_timeline = helper.daily_timeline(selected_user, df_filtered.sample(min(5000, len(df_filtered))))
+                if not daily_timeline.empty:
+                    plt.figure(figsize=(10, 5))
+                    sns.lineplot(data=daily_timeline, x='date', y='message', color='black')
+                    plt.title("Daily Timeline")
+                    plt.xlabel("Date")
+                    plt.ylabel("Messages")
+                    st.pyplot(plt)
+                    plt.clf()
+                st.title("Activity Map")
+                col1, col2 = st.columns(2)
+                with col1:
+                    st.header("Most Busy Day")
+                    busy_day = helper.week_activity_map(selected_user, df_filtered)
+                    if not busy_day.empty:
+                        plt.figure(figsize=(10, 5))
+                        sns.barplot(x=busy_day.index, y=busy_day.values, palette="Purples_r")
+                        plt.title("Most Busy Day")
+                        plt.xlabel("Day of Week")
+                        plt.ylabel("Message Count")
+                        st.pyplot(plt)
+                        plt.clf()
+                with col2:
+                    st.header("Most Busy Month")
+                    busy_month = helper.month_activity_map(selected_user, df_filtered)
+                    if not busy_month.empty:
+                        plt.figure(figsize=(10, 5))
+                        sns.barplot(x=busy_month.index, y=busy_month.values, palette="Oranges_r")
+                        plt.title("Most Busy Month")
+                        plt.xlabel("Month")
+                        plt.ylabel("Message Count")
+                        st.pyplot(plt)
+                        plt.clf()
+                if selected_user == 'Overall':
+                    st.title("Most Busy Users")
+                    x, new_df = helper.most_busy_users(df_filtered)
+                    if not x.empty:
+                        plt.figure(figsize=(10, 5))
+                        sns.barplot(x=x.index, y=x.values, palette="Reds_r")
+                        plt.title("Most Busy Users")
+                        plt.xlabel("User")
+                        plt.ylabel("Message Count")
+                        plt.xticks(rotation=45)
+                        st.pyplot(plt)
+                        st.title("Word Count by User")
+                        plt.clf()
+                        st.dataframe(new_df)
+                # Most common words analysis
+                st.title("Most Common Words")
+                most_common_df = helper.most_common_words(selected_user, df_filtered)
+                if not most_common_df.empty:
+                    fig, ax = plt.subplots(figsize=(10, 6))
+                    sns.barplot(y=most_common_df[0], x=most_common_df[1], ax=ax, palette="Blues_r")
+                    ax.set_title("Top 20 Most Common Words")
+                    ax.set_xlabel("Frequency")
+                    ax.set_ylabel("Words")
+                    plt.xticks(rotation='vertical')
+                    st.pyplot(fig)
+                    plt.clf()
+                else:
+                    st.warning("No data available for most common words.")
+                # Emoji analysis
+                st.title("Emoji Analysis")
+                emoji_df = helper.emoji_helper(selected_user, df_filtered)
+                if not emoji_df.empty:
+                    col1, col2 = st.columns(2)
+                    with col1:
+                        st.subheader("Top Emojis Used")
+                        st.dataframe(emoji_df)
+                    with col2:
+                        fig, ax = plt.subplots(figsize=(8, 8))
+                        ax.pie(emoji_df[1].head(), labels=emoji_df[0].head(),
+                              autopct="%0.2f%%", startangle=90,
+                              colors=sns.color_palette("pastel"))
+                        ax.set_title("Top Emoji Distribution")
+                        st.pyplot(fig)
+                        plt.clf()
+                else:
+                    st.warning("No data available for emoji analysis.")
+                # Sentiment Analysis Visualizations
+                st.title("📈 Sentiment Analysis")
+                # Convert month names to abbreviated format
+                month_map = {
+                    'January': 'Jan', 'February': 'Feb', 'March': 'Mar', 'April': 'Apr',
+                    'May': 'May', 'June': 'Jun', 'July': 'Jul', 'August': 'Aug',
+                    'September': 'Sep', 'October': 'Oct', 'November': 'Nov', 'December': 'Dec'
+                }
+                df_filtered['month'] = df_filtered['month'].map(month_map)
+                # Group by month and sentiment
+                monthly_sentiment = df_filtered.groupby(['month', 'sentiment']).size().unstack(fill_value=0)
+                # Plotting: Histogram (Bar Chart) for each sentiment
+                st.write("### Sentiment Count by Month (Histogram)")
+                # Create a figure with subplots for each sentiment
+                fig, axes = plt.subplots(1, 3, figsize=(18, 5))
+                # Plot Positive Sentiment
+                if 'positive' in monthly_sentiment:
+                    axes[0].bar(monthly_sentiment.index, monthly_sentiment['positive'], color='green')
+                axes[0].set_title('Positive Sentiment')
+                axes[0].set_xlabel('Month')
+                axes[0].set_ylabel('Count')
+                # Plot Neutral Sentiment
+                if 'neutral' in monthly_sentiment:
+                    axes[1].bar(monthly_sentiment.index, monthly_sentiment['neutral'], color='blue')
+                axes[1].set_title('Neutral Sentiment')
+                axes[1].set_xlabel('Month')
+                axes[1].set_ylabel('Count')
+                # Plot Negative Sentiment
+                if 'negative' in monthly_sentiment:
+                    axes[2].bar(monthly_sentiment.index, monthly_sentiment['negative'], color='red')
+                axes[2].set_title('Negative Sentiment')
+                axes[2].set_xlabel('Month')
+                axes[2].set_ylabel('Count')
+                # Display the plots in Streamlit
                 st.pyplot(fig)
+                plt.clf()
+                # Count sentiments per day of the week
+                sentiment_counts = df_filtered.groupby(['day_of_week', 'sentiment']).size().unstack(fill_value=0)
+                # Sort days correctly
+                day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
+                sentiment_counts = sentiment_counts.reindex(day_order)
+                # Daily Sentiment Analysis
+                st.write("### Daily Sentiment Analysis")
+                # Create a Matplotlib figure
+                fig, ax = plt.subplots(figsize=(10, 5))
+                sentiment_counts.plot(kind='bar', stacked=False, ax=ax, color=['red', 'blue', 'green'])
+                # Customize the plot
+                ax.set_xlabel("Day of the Week")
+                ax.set_ylabel("Count")
+                ax.set_title("Sentiment Distribution per Day of the Week")
+                ax.legend(title="Sentiment")
+                # Display the plot in Streamlit
+                st.pyplot(fig)
+                plt.clf()
+                # Count messages per user per sentiment (only for Overall view)
+                if selected_user == 'Overall':
+                    sentiment_counts = df_filtered.groupby(['user', 'sentiment']).size().reset_index(name='Count')
+                    # Calculate total messages per sentiment
+                    total_per_sentiment = df_filtered['sentiment'].value_counts().to_dict()
+                    # Add percentage column
+                    sentiment_counts['Percentage'] = sentiment_counts.apply(
+                        lambda row: (row['Count'] / total_per_sentiment[row['sentiment']]) * 100, axis=1
+                    )
+                    # Separate tables for each sentiment
+                    positive_df = sentiment_counts[sentiment_counts['sentiment'] == 'positive'].sort_values(by='Count', ascending=False).head(10)
+                    neutral_df = sentiment_counts[sentiment_counts['sentiment'] == 'neutral'].sort_values(by='Count', ascending=False).head(10)
+                    negative_df = sentiment_counts[sentiment_counts['sentiment'] == 'negative'].sort_values(by='Count', ascending=False).head(10)
+                    # Sentiment Contribution Analysis
+                    st.write("### Sentiment Contribution by User")
+                    # Create three columns for side-by-side display
+                    col1, col2, col3 = st.columns(3)
+                    # Display Positive Table
+                    with col1:
+                        st.subheader("Top Positive Contributors")
+                        if not positive_df.empty:
+                            st.dataframe(positive_df[['user', 'Count', 'Percentage']])
+                        else:
+                            st.warning("No positive sentiment data")
+                    # Display Neutral Table
+                    with col2:
+                        st.subheader("Top Neutral Contributors")
+                        if not neutral_df.empty:
+                            st.dataframe(neutral_df[['user', 'Count', 'Percentage']])
+                        else:
+                            st.warning("No neutral sentiment data")
+                    # Display Negative Table
+                    with col3:
+                        st.subheader("Top Negative Contributors")
+                        if not negative_df.empty:
+                            st.dataframe(negative_df[['user', 'Count', 'Percentage']])
+                        else:
+                            st.warning("No negative sentiment data")
+                             # Topic Analysis Section
+                st.title("🔍 Area of Focus: Topic Analysis")
+                # Check if topic column exists, otherwise perform topic modeling
+                # if 'topic' not in df_filtered.columns:
+                #     with st.spinner("Performing topic modeling..."):
+                #         try:
+                #             # Add topic modeling here or ensure your helper functions handle it
+                #             df_filtered = helper.perform_topic_modeling(df_filtered)
+                #         except Exception as e:
+                #             st.error(f"Topic modeling failed: {str(e)}")
+                #             st.stop()
+                # Plot Topic Distribution
+                st.header("Topic Distribution")
+                try:
+                    fig = helper.plot_topic_distribution(df_filtered)
+                    st.pyplot(fig)
+                    plt.clf()
+                except Exception as e:
+                    st.warning(f"Could not display topic distribution: {str(e)}")
                 # Display Sample Messages for Each Topic
                 st.header("Sample Messages for Each Topic")
+                if 'topic' in df_filtered.columns:
+                    for topic_id in sorted(df_filtered['topic'].unique()):
+                        st.subheader(f"Topic {topic_id}")
+                        # Get messages for the current topic
+                        filtered_messages = df_filtered[df_filtered['topic'] == topic_id]['message']
+                        # Determine sample size
+                        sample_size = min(5, len(filtered_messages))
+                        if sample_size > 0:
+                            sample_messages = filtered_messages.sample(sample_size, replace=False).tolist()
+                            for msg in sample_messages:
+                                st.write(f"- {msg}")
+                        else:
+                            st.write("No messages available for this topic.")
+                else:
+                    st.warning("Topic information not available")
+                # Topic Distribution Over Time
+                st.header("📅 Topic Trends Over Time")
+                # Add time frequency selector
+                time_freq = st.selectbox("Select Time Frequency", ["Daily", "Weekly", "Monthly"], key='time_freq')
+                # Plot topic trends
+                try:
+                    freq_map = {"Daily": "D", "Weekly": "W", "Monthly": "M"}
+                    topic_distribution = helper.topic_distribution_over_time(df_filtered, time_freq=freq_map[time_freq])
+                    # Choose between static and interactive plot
+                    use_plotly = st.checkbox("Use interactive visualization", value=True, key='use_plotly')
+                    if use_plotly:
+                        fig = helper.plot_topic_distribution_over_time_plotly(topic_distribution)
+                        st.plotly_chart(fig, use_container_width=True)
+                    else:
+                        fig = helper.plot_topic_distribution_over_time(topic_distribution)
+                        st.pyplot(fig)
+                        plt.clf()
+                except Exception as e:
+                    st.warning(f"Could not display topic trends: {str(e)}")
+                # Clustering Analysis Section
+                st.title("🧩 Conversation Clusters")
+                # Number of clusters input
+                n_clusters = st.slider("Select number of clusters",
+                                       min_value=2,
+                                       max_value=10,
+                                       value=5,
+                                       key='n_clusters')
+                # Perform clustering
+                with st.spinner("Analyzing conversation clusters..."):
+                    try:
+                        df_clustered, reduced_features, _ = preprocessor.preprocess_for_clustering(df_filtered, n_clusters=n_clusters)
+                        # Plot clusters
+                        st.header("Cluster Visualization")
+                        fig = helper.plot_clusters(reduced_features, df_clustered['cluster'])
+                        st.pyplot(fig)
+                        plt.clf()
+                        # Cluster Insights
+                        st.header("📌 Cluster Insights")
+                        # 1. Dominant Conversation Themes
+                        st.subheader("1. Dominant Themes")
+                        cluster_labels = helper.get_cluster_labels(df_clustered, n_clusters)
+                        for cluster_id, label in cluster_labels.items():
+                            st.write(f"**Cluster {cluster_id}**: {label}")
+                        # 2. Temporal Patterns
+                        st.subheader("2. Temporal Patterns")
+                        temporal_trends = helper.get_temporal_trends(df_clustered)
+                        for cluster_id, trend in temporal_trends.items():
+                            st.write(f"**Cluster {cluster_id}**: Peaks on {trend['peak_day']} around {trend['peak_time']}")
+                        # 3. User Contributions
+                        if selected_user == 'Overall':
+                            st.subheader("3. Top Contributors")
+                            user_contributions = helper.get_user_contributions(df_clustered)
+                            for cluster_id, users in user_contributions.items():
+                                st.write(f"**Cluster {cluster_id}**: {', '.join(users[:3])}...")
+                        # 4. Sentiment by Cluster
+                        st.subheader("4. Sentiment Analysis")
+                        sentiment_by_cluster = helper.get_sentiment_by_cluster(df_clustered)
+                        for cluster_id, sentiment in sentiment_by_cluster.items():
+                            st.write(f"**Cluster {cluster_id}**: {sentiment['positive']}% positive, {sentiment['neutral']}% neutral, {sentiment['negative']}% negative")
+                        # Sample messages from each cluster
+                        st.subheader("Sample Messages")
+                        for cluster_id in sorted(df_clustered['cluster'].unique()):
+                            with st.expander(f"Cluster {cluster_id} Messages"):
+                                cluster_msgs = df_clustered[df_clustered['cluster'] == cluster_id]['message']
+                                sample_size = min(3, len(cluster_msgs))
+                                if sample_size > 0:
+                                    for msg in cluster_msgs.sample(sample_size, replace=False):
+                                        st.write(f"- {msg}")
+                                else:
+                                    st.write("No messages available")
+                    except Exception as e:
+                        st.error(f"Clustering failed: {str(e)}")

debug_regex.py DELETED Viewed

@@ -1,23 +0,0 @@
-import pandas as pd
-import re
-pattern = r"^(?P<Date>\d{1,2}/\d{1,2}/\d{2,4}),\s+(?P<Time>[\d:]+(?:\S*\s?[AP]M)?)\s+-\s+(?:(?P<Sender>.*?):\s+)?(?P<Message>.*)$"
-lines = [
-    "12/12/23, 10:00 - User1: Hello",
-    "1/1/23, 1:00 - User2: Hi",
-    "10/10/2023, 10:00 PM - User3: Test",
-    "12/12/23, 10:00 - System Message"
-]
-df = pd.DataFrame({'line': lines})
-extracted = df['line'].str.extract(pattern)
-print("Extracted DataFrame:")
-print(extracted)
-print("\nRegex Match Check:")
-for line in lines:
-    match = re.match(pattern, line)
-    print(f"'{line}' -> Match: {bool(match)}")
-    if match:
-        print(match.groupdict())

finetune.py DELETED Viewed

@@ -1,102 +0,0 @@
-# Ensure you've run: pip install transformers datasets torch numpy tf-keras
-# PyTorch should already be installed (2.4.0 CPU version is fine)
-import torch
-from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments, TextClassificationPipeline
-from datasets import load_dataset
-import numpy as np
-# Check device: Use MPS if available (Apple Silicon), else CPU
-device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
-print(f"Using device: {device}")
-# Step 1: Load the pre-trained model and tokenizer
-model_name = "lxyuan/distilbert-base-multilingual-cased-sentiments-student"
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device)
-# Step 2: Load and prepare the tweet_eval sentiment dataset
-dataset = load_dataset("tweet_eval", "sentiment")
-# Remap labels: tweet_eval (0=negative, 1=neutral, 2=positive) to our model (0=positive, 1=neutral, 2=negative)
-def remap_labels(example):
-    label_map = {0: 2, 1: 1, 2: 0}  # Negative->2, Neutral->1, Positive->0
-    example["label"] = label_map[example["label"]]
-    return example
-dataset = dataset.map(remap_labels)
-# Tokenize the dataset
-def tokenize_function(examples):
-    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)
-tokenized_dataset = dataset.map(tokenize_function, batched=True)
-tokenized_dataset = tokenized_dataset.remove_columns(["text"])
-tokenized_dataset = tokenized_dataset.rename_column("label", "labels")
-tokenized_dataset.set_format("torch")
-# Split into train and eval datasets
-train_dataset = tokenized_dataset["train"]  # ~45,580 examples
-eval_dataset = tokenized_dataset["test"]    # ~12,000 examples
-# Step 3: Define a function to compute accuracy
-def compute_metrics(eval_pred):
-    logits, labels = eval_pred
-    predictions = np.argmax(logits, axis=-1)
-    accuracy = (predictions == labels).mean()
-    return {"accuracy": accuracy}
-# Step 4: Set up training arguments
-training_args = TrainingArguments(
-    output_dir="./fine-tuned-sentiment-large",
-    num_train_epochs=3,
-    per_device_train_batch_size=4,  # Reduced for 8GB RAM
-    per_device_eval_batch_size=4,   # Reduced for 8GB RAM
-    warmup_steps=500,
-    weight_decay=0.01,
-    logging_dir="./logs",
-    logging_steps=100,
-    eval_strategy="epoch",          # Updated from evaluation_strategy
-    save_strategy="epoch",
-    learning_rate=2e-5,
-    fp16=False,                     # Disabled (not supported on MPS)
-    # Use MPS acceleration if available
-    no_cuda=True,                   # Force no CUDA since M2 doesn't support it
-    # torch.backends.mps.is_available() check is handled by device selection
-)
-# Step 5: Initialize and train the model
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=train_dataset,
-    eval_dataset=eval_dataset,
-    compute_metrics=compute_metrics,
-)
-print("Starting training...")
-trainer.train()
-# Step 6: Save the fine-tuned model
-model.save_pretrained("./fine-tuned-sentiment-large")
-tokenizer.save_pretrained("./fine-tuned-sentiment-large")
-print("Model saved to ./fine-tuned-sentiment-large")
-# Step 7: Evaluate the model on the test set
-eval_results = trainer.evaluate()
-print(f"Evaluation results: {eval_results}")
-# Step 8: Test on your specific examples
-classifier = TextClassificationPipeline(
-    model=AutoModelForSequenceClassification.from_pretrained("./fine-tuned-sentiment-large").to(device),
-    tokenizer=AutoTokenizer.from_pretrained("./fine-tuned-sentiment-large"),
-    device=0 if device.type == "mps" else -1,  # 0 for MPS, -1 for CPU
-    return_all_scores=False
-)
-texts = ["Great service!", "It's okay, nothing special.", "Terrible experience."]
-results = classifier(texts)
-print("\nTesting on custom examples:")
-for text, result in zip(texts, results):
-    print(f"Text: {text} -> Sentiment: {result['label']}")

helper.py CHANGED Viewed

@@ -3,243 +3,130 @@ from wordcloud import WordCloud
 import pandas as pd
 from collections import Counter
 import emoji
 import matplotlib.pyplot as plt
 import seaborn as sns
-import plotly.express as px
-import numpy as np
-from sklearn.feature_extraction.text import TfidfVectorizer
-from openrouter_chat import generate_title_from_messages
 extract = URLExtract()
-def fetch_stats(selected_user,df):
     if selected_user != 'Overall':
         df = df[df['user'] == selected_user]
-    # fetch the number of messages
     num_messages = df.shape[0]
-    # fetch the total number of words
-    words = []
-    for message in df['message']:
-        words.extend(message.split())
-    # fetch number of media messages
-    num_media_messages = df[df['unfiltered_messages'].str.contains('<media omitted>', case=False, na=False)].shape[0]
-    # fetch number of links shared
-    links = []
-    for message in df['unfiltered_messages']:
-        links.extend(extract.find_urls(message))
-    return num_messages,len(words),num_media_messages,len(links)
 def most_busy_users(df):
     x = df['user'].value_counts().head()
     df = round((df['user'].value_counts() / df.shape[0]) * 100, 2).reset_index().rename(
         columns={'index': 'percentage', 'user': 'Name'})
-    return x,df
 def create_wordcloud(selected_user, df):
-    # f = open('stop_hinglish.txt', 'r')
-    stop_words = df
     if selected_user != 'Overall':
         df = df[df['user'] == selected_user]
     temp = df[df['user'] != 'group_notification']
     temp = temp[~temp['message'].str.lower().str.contains('<media omitted>')]
-    def remove_stop_words(message):
-        y = []
-        for word in message.lower().split():
-            if word not in stop_words:
-                y.append(word)
-        return " ".join(y)
     wc = WordCloud(width=500, height=500, min_font_size=10, background_color='white')
-    temp['message'] = temp['message'].apply(remove_stop_words)
     df_wc = wc.generate(temp['message'].str.cat(sep=" "))
     return df_wc
 def most_common_words(selected_user, df):
-    # f = open('stop_hinglish.txt','r')
-    stop_words = df
     if selected_user != 'Overall':
         df = df[df['user'] == selected_user]
     temp = df[df['user'] != 'group_notification']
     temp = temp[~temp['message'].str.lower().str.contains('<media omitted>')]
-    words = []
-    for message in temp['message']:
-        for word in message.lower().split():
-            if word not in stop_words:
-                words.append(word)
-    most_common_df = pd.DataFrame(Counter(words).most_common(20))
-    return most_common_df
 def emoji_helper(selected_user, df):
     if selected_user != 'Overall':
         df = df[df['user'] == selected_user]
-    emojis = []
-    for message in df['unfiltered_messages']:
-        emojis.extend([c for c in message if c in emoji.EMOJI_DATA])
-    emoji_df = pd.DataFrame(Counter(emojis).most_common(len(Counter(emojis))))
-    if emoji_df.empty:
-        return pd.DataFrame(columns=[0, 1])
-    return emoji_df
-def monthly_timeline(selected_user,df):
     if selected_user != 'Overall':
         df = df[df['user'] == selected_user]
-    timeline = df.groupby(['year','month']).count()['message'].reset_index()
-    time = []
-    for i in range(timeline.shape[0]):
-        time.append(timeline['month'][i] + "-" + str(timeline['year'][i]))
-    timeline['time'] = time
     return timeline
-def daily_timeline(selected_user,df):
     if selected_user != 'Overall':
         df = df[df['user'] == selected_user]
-    daily_timeline = df.groupby('date').count()['message'].reset_index()
-    return daily_timeline
-def week_activity_map(selected_user,df):
     if selected_user != 'Overall':
         df = df[df['user'] == selected_user]
-    return df['day'].value_counts()
-def month_activity_map(selected_user,df):
     if selected_user != 'Overall':
         df = df[df['user'] == selected_user]
     return df['month'].value_counts()
-def activity_heatmap(selected_user,df):
     if selected_user != 'Overall':
         df = df[df['user'] == selected_user]
-    user_heatmap = df.pivot_table(index='day', columns='period', values='message', aggfunc='count').fillna(0)
-    return user_heatmap
-def generate_wordcloud(text, color):
-    wordcloud = WordCloud(width=400, height=300, background_color=color, colormap="viridis").generate(text)
-    return wordcloud
-def create_heuristic_title(topic, idx):
-    return f"Topic {idx + 1}: {', '.join(topic[:3])}"
-def generate_topic_titles_from_messages(topic_messages_map):
-    """
-    Generate titles for topics using OpenRouter AI based on message content.
-    Args:
-        topic_messages_map (dict): key=topic_id, value=list of message strings
-    Returns:
-        dict: key=topic_id, value=generated title
-    """
-    titles = {}
-    print("Generating topic titles using OpenRouter...")
-    for topic_id, messages in topic_messages_map.items():
-        try:
-            # Generate title from sample messages
-            title = generate_title_from_messages(messages)
-            titles[topic_id] = title
-            print(f"Topic {topic_id}: {title}\n\n\n\n{messages}")
-        except Exception as e:
-            print(f"Failed to generate title for topic {topic_id}: {e}")
-            titles[topic_id] = f"Topic {topic_id}"
-    return titles
-def create_basic_titles(topics):
-    """Fallback to keyword-based titles if AI fails or is unused."""
-    titles = []
-    for idx, topic_words in enumerate(topics):
-        if isinstance(topic_words, list) and len(topic_words) >= 3:
-            title = f"Topic {idx}: {', '.join(topic_words[:3])}"
-        else:
-            title = f"Topic {idx}"
-        titles.append(title)
-    return titles
-def plot_topics(topics, use_ai=True, **kwargs):
-    """
-    Plots a bar chart for the top words in each topic.
-    Args:
-        topics: List of topics (lists of top words)
-        custom_titles: Optional list or dict of titles to use instead of generating them
-    Returns:
-        matplotlib.figure.Figure: The plot figure
-    """
-    if not topics or not isinstance(topics[0], list):
-        raise ValueError("topics must be a list of lists of words.")
-    # Determine titles
-    custom_titles = kwargs.get('custom_titles')
-    if custom_titles:
-        # If it's a dict, convert to list based on index
-        if isinstance(custom_titles, dict):
-            titles = [custom_titles.get(i, f"Topic {i}") for i in range(len(topics))]
-        else:
-            titles = custom_titles
-    else:
-        # Fallback to basic keyword-based titles
-        titles = create_basic_titles(topics)
-    fig, axes = plt.subplots(1, len(topics), figsize=(20, 10))
-    if len(topics) == 1:
-        axes = [axes]  # Ensure axes is iterable for single topic
-    for idx, topic in enumerate(topics):
-        if not isinstance(topic, list):
-            raise ValueError(f"Topic {idx} is not a list of words.")
-        top_words = topic[:10]  # Show top 10 words
-        axes[idx].barh(range(len(top_words)), range(len(top_words)))
-        axes[idx].set_yticks(range(len(top_words)))
-        axes[idx].set_yticklabels(top_words)
-        axes[idx].set_title(titles[idx], fontsize=14, fontweight='bold')
-        axes[idx].set_xlabel("Word Importance")
-        axes[idx].set_ylabel("Top Words")
-    plt.tight_layout()
-    return fig
 def plot_topic_distribution(df):
     """
     Plots the distribution of topics in the chat data.
     """
     topic_counts = df['topic'].value_counts().sort_index()
     fig, ax = plt.subplots()
-    sns.barplot(x=topic_counts.index, y=topic_counts.values, ax=ax, palette="viridis", hue=topic_counts.index, legend=False)
     ax.set_title("Topic Distribution")
     ax.set_xlabel("Topic")
     ax.set_ylabel("Number of Messages")
@@ -252,16 +139,6 @@ def most_frequent_keywords(messages, top_n=10):
     words = [word for msg in messages for word in msg.split()]
     word_freq = Counter(words)
     return word_freq.most_common(top_n)
-def topic_distribution_over_time(df, time_freq='M'):
-    """
-    Analyzes the distribution of topics over time.
-    """
-    # Group by time interval and topic
-    df['time_period'] = df['date'].dt.to_period(time_freq)
-    topic_distribution = df.groupby(['time_period', 'topic']).size().unstack(fill_value=0)
-    return topic_distribution
 def plot_topic_distribution_over_time(topic_distribution):
     """
     Plots the distribution of topics over time using a line chart.
@@ -286,11 +163,37 @@ def plot_most_frequent_keywords(keywords):
     """
     words, counts = zip(*keywords)
     fig, ax = plt.subplots()
-    sns.barplot(x=list(counts), y=list(words), ax=ax, palette="viridis", hue=list(words), legend=False)
     ax.set_title("Most Frequent Keywords")
     ax.set_xlabel("Frequency")
     ax.set_ylabel("Keyword")
     return fig
 def plot_topic_distribution_over_time_plotly(topic_distribution):
     """
@@ -304,7 +207,6 @@ def plot_topic_distribution_over_time_plotly(topic_distribution):
                   title="Topic Distribution Over Time", labels={'time_period': 'Time Period', 'count': 'Number of Messages'})
     fig.update_layout(legend_title_text='Topics', xaxis_tickangle=-45)
     return fig
 def plot_clusters(reduced_features, clusters):
     """
     Visualize clusters using t-SNE.
@@ -327,25 +229,19 @@ def plot_clusters(reduced_features, clusters):
     plt.ylabel("t-SNE Component 2")
     plt.tight_layout()
     return plt.gcf()
-def remove_emojis(text):
-    """Removes emojis from text to prevent matplotlib warnings."""
-    return text.encode('ascii', 'ignore').decode('ascii')
 def get_cluster_labels(df, n_clusters):
     """
     Generate descriptive labels for each cluster based on top keywords.
     """
     vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
     tfidf_matrix = vectorizer.fit_transform(df['lemmatized_message'])
     cluster_labels = {}
-    # Reset index to ensure alignment with tfidf_matrix
-    df_reset = df.reset_index(drop=True)
     for cluster_id in range(n_clusters):
-        # Get indices where cluster matches
-        cluster_indices = df_reset[df_reset['cluster'] == cluster_id].index
         if len(cluster_indices) > 0:
             cluster_tfidf = tfidf_matrix[cluster_indices]
             top_keywords = np.argsort(cluster_tfidf.sum(axis=0).A1)[-3:][::-1]

 import pandas as pd
 from collections import Counter
 import emoji
+import plotly.express as px
 import matplotlib.pyplot as plt
 import seaborn as sns
 extract = URLExtract()
+def fetch_stats(selected_user, df):
     if selected_user != 'Overall':
         df = df[df['user'] == selected_user]
     num_messages = df.shape[0]
+    words = sum(len(msg.split()) for msg in df['message'])
+    num_media_messages = df[df['unfiltered_messages'] == '<media omitted>\n'].shape[0]
+    links = sum(len(extract.find_urls(msg)) for msg in df['unfiltered_messages'])
+    return num_messages, words, num_media_messages, links
 def most_busy_users(df):
     x = df['user'].value_counts().head()
     df = round((df['user'].value_counts() / df.shape[0]) * 100, 2).reset_index().rename(
         columns={'index': 'percentage', 'user': 'Name'})
+    return x, df
 def create_wordcloud(selected_user, df):
     if selected_user != 'Overall':
         df = df[df['user'] == selected_user]
     temp = df[df['user'] != 'group_notification']
     temp = temp[~temp['message'].str.lower().str.contains('<media omitted>')]
     wc = WordCloud(width=500, height=500, min_font_size=10, background_color='white')
     df_wc = wc.generate(temp['message'].str.cat(sep=" "))
     return df_wc
 def most_common_words(selected_user, df):
     if selected_user != 'Overall':
         df = df[df['user'] == selected_user]
     temp = df[df['user'] != 'group_notification']
     temp = temp[~temp['message'].str.lower().str.contains('<media omitted>')]
+    words = [word for msg in temp['message'] for word in msg.lower().split()]
+    return pd.DataFrame(Counter(words).most_common(20))
 def emoji_helper(selected_user, df):
     if selected_user != 'Overall':
         df = df[df['user'] == selected_user]
+    emojis = [c for msg in df['unfiltered_messages'] for c in msg if c in emoji.EMOJI_DATA]
+    return pd.DataFrame(Counter(emojis).most_common(len(Counter(emojis))))
+def monthly_timeline(selected_user, df):
     if selected_user != 'Overall':
         df = df[df['user'] == selected_user]
+    timeline = df.groupby(['year', 'month']).count()['message'].reset_index()
+    timeline['time'] = timeline['month'] + "-" + timeline['year'].astype(str)
     return timeline
+def daily_timeline(selected_user, df):
     if selected_user != 'Overall':
         df = df[df['user'] == selected_user]
+    return df.groupby('date').count()['message'].reset_index()
+def week_activity_map(selected_user, df):
     if selected_user != 'Overall':
         df = df[df['user'] == selected_user]
+    return df['day_of_week'].value_counts()
+def month_activity_map(selected_user, df):
     if selected_user != 'Overall':
         df = df[df['user'] == selected_user]
     return df['month'].value_counts()
+def plot_topic_distribution(df):
+    topic_counts = df['topic'].value_counts().sort_index()
+    fig = px.bar(x=topic_counts.index, y=topic_counts.values, title="Topic Distribution", color_discrete_sequence=['viridis'])
+    return fig
+def topic_distribution_over_time(df, time_freq='M'):
+    df['time_period'] = df['date'].dt.to_period(time_freq)
+    return df.groupby(['time_period', 'topic']).size().unstack(fill_value=0)
+def plot_topic_distribution_over_time_plotly(topic_distribution):
+    topic_distribution = topic_distribution.reset_index()
+    topic_distribution['time_period'] = topic_distribution['time_period'].dt.to_timestamp()
+    topic_distribution = topic_distribution.melt(id_vars='time_period', var_name='topic', value_name='count')
+    fig = px.line(topic_distribution, x='time_period', y='count', color='topic', title="Topic Distribution Over Time")
+    fig.update_layout(legend_title_text='Topics', xaxis_tickangle=-45)
+    return fig
+def plot_clusters(reduced_features, clusters):
+    fig = px.scatter(x=reduced_features[:, 0], y=reduced_features[:, 1], color=clusters, title="Message Clusters (t-SNE)")
+    return fig
+def most_common_words(selected_user, df):
+    # f = open('stop_hinglish.txt','r')
+    stop_words = df
     if selected_user != 'Overall':
         df = df[df['user'] == selected_user]
+    temp = df[df['user'] != 'group_notification']
+    temp = temp[~temp['message'].str.lower().str.contains('<media omitted>')]
+    words = []
+    for message in temp['message']:
+        for word in message.lower().split():
+            if word not in stop_words:
+                words.append(word)
+    most_common_df = pd.DataFrame(Counter(words).most_common(20))
+    return most_common_df
+def emoji_helper(selected_user, df):
+    if selected_user != 'Overall':
+        df = df[df['user'] == selected_user]
+    emojis = []
+    for message in df['unfiltered_messages']:
+        emojis.extend([c for c in message if c in emoji.EMOJI_DATA])
+    emoji_df = pd.DataFrame(Counter(emojis).most_common(len(Counter(emojis))))
+    return emoji_df
 def plot_topic_distribution(df):
     """
     Plots the distribution of topics in the chat data.
     """
     topic_counts = df['topic'].value_counts().sort_index()
     fig, ax = plt.subplots()
+    sns.barplot(x=topic_counts.index, y=topic_counts.values, ax=ax, palette="viridis")
     ax.set_title("Topic Distribution")
     ax.set_xlabel("Topic")
     ax.set_ylabel("Number of Messages")
     words = [word for msg in messages for word in msg.split()]
     word_freq = Counter(words)
     return word_freq.most_common(top_n)
 def plot_topic_distribution_over_time(topic_distribution):
     """
     Plots the distribution of topics over time using a line chart.
     """
     words, counts = zip(*keywords)
     fig, ax = plt.subplots()
+    sns.barplot(x=list(counts), y=list(words), ax=ax, palette="viridis")
     ax.set_title("Most Frequent Keywords")
     ax.set_xlabel("Frequency")
     ax.set_ylabel("Keyword")
     return fig
+def topic_distribution_over_time(df, time_freq='M'):
+    """
+    Analyzes the distribution of topics over time.
+    """
+    # Group by time interval and topic
+    df['time_period'] = df['date'].dt.to_period(time_freq)
+    topic_distribution = df.groupby(['time_period', 'topic']).size().unstack(fill_value=0)
+    return topic_distribution
+def plot_topic_distribution_over_time(topic_distribution):
+    """
+    Plots the distribution of topics over time using a line chart.
+    """
+    fig, ax = plt.subplots(figsize=(12, 6))
+    # Plot each topic as a separate line
+    for topic in topic_distribution.columns:
+        ax.plot(topic_distribution.index.to_timestamp(), topic_distribution[topic], label=f"Topic {topic}")
+    ax.set_title("Topic Distribution Over Time")
+    ax.set_xlabel("Time Period")
+    ax.set_ylabel("Number of Messages")
+    ax.legend(title="Topics", bbox_to_anchor=(1.05, 1), loc='upper left')
+    plt.xticks(rotation=45)
+    plt.tight_layout()
+    return fig
 def plot_topic_distribution_over_time_plotly(topic_distribution):
     """
                   title="Topic Distribution Over Time", labels={'time_period': 'Time Period', 'count': 'Number of Messages'})
     fig.update_layout(legend_title_text='Topics', xaxis_tickangle=-45)
     return fig
 def plot_clusters(reduced_features, clusters):
     """
     Visualize clusters using t-SNE.
     plt.ylabel("t-SNE Component 2")
     plt.tight_layout()
     return plt.gcf()
 def get_cluster_labels(df, n_clusters):
     """
     Generate descriptive labels for each cluster based on top keywords.
     """
+    from sklearn.feature_extraction.text import TfidfVectorizer
+    import numpy as np
     vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
     tfidf_matrix = vectorizer.fit_transform(df['lemmatized_message'])
     cluster_labels = {}
     for cluster_id in range(n_clusters):
+        cluster_indices = df[df['cluster'] == cluster_id].index
         if len(cluster_indices) > 0:
             cluster_tfidf = tfidf_matrix[cluster_indices]
             top_keywords = np.argsort(cluster_tfidf.sum(axis=0).A1)[-3:][::-1]

naive_bayes_model.pkl DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:b4298e7bca558075d89911be8c06311d1276cfc414a445f6a5ed6561201985f1
-size 480879

openrouter_chat.py DELETED Viewed

@@ -1,91 +0,0 @@
-import os
-import requests
-from dotenv import load_dotenv
-# Load .env file
-load_dotenv()
-API_KEY = os.getenv("OPENROUTER_API_KEY")
-if not API_KEY:
-    raise ValueError("OPENROUTER_API_KEY not found in .env file")
-API_URL = "https://openrouter.ai/api/v1/chat/completions"
-def get_chat_completion(messages, model="openai/gpt-4o-mini"):
-    """
-    Sends a list of messages to the OpenRouter API and returns the response content.
-    Args:
-        messages (list): A list of message dictionaries (e.g., [{"role": "user", "content": "..."}]).
-        model (str): The model to use.
-    Returns:
-        str: The content of the AI's response.
-    """
-    headers = {
-        "Authorization": f"Bearer {API_KEY}",
-        "Content-Type": "application/json",
-        "HTTP-Referer": "http://localhost",
-        "X-Title": "WhatsApp Chat Analyzer"
-    }
-    payload = {
-        "model": model,
-        "messages": messages
-    }
-    try:
-        response = requests.post(API_URL, headers=headers, json=payload)
-        response.raise_for_status()
-        data = response.json()
-        return data["choices"][0]["message"]["content"]
-    except Exception as e:
-        print(f"Error calling OpenRouter API: {e}")
-        return None
-def generate_title_from_messages(messages_list):
-    """
-    Generates a short, descriptive topic title based on a list of messages.
-    Args:
-        messages_list (list): A list of strings, where each string is a chat message.
-    Returns:
-        str: A generated title.
-    """
-    if not messages_list:
-        return "Unknown Topic"
-    # Limit to reasonable amount of text to avoid context limits or high costs
-    # Join top messages with newlines
-    context = "\n".join(messages_list[:10])
-    prompt = (
-        "Analyze the following WhatsApp chat messages and generate a SINGLE, short, descriptive title "
-        "(max 5 words) that summarizes the conversation topic. Do not use quotes or prefixes like 'Topic:'. "
-        "Just the title.\n\n"
-        f"Messages:\n{context}"
-    )
-    messages = [
-        {"role": "system", "content": "You are a helpful assistant that summarizes chat topics."},
-        {"role": "user", "content": prompt}
-    ]
-    title = get_chat_completion(messages)
-    print("Title:\n\n\n\n\n", title)
-    return title.strip() if title else "General Discussion"
-if __name__ == "__main__":
-    print("🤖 OpenRouter AI Chat (type 'exit' to quit)\n")
-    while True:
-        user_input = input("You: ")
-        if user_input.lower() == "exit":
-            break
-        # Test the basic chat function
-        msgs = [{"role": "user", "content": user_input}]
-        reply = get_chat_completion(msgs)
-        print("\nAI:", reply, "\n")

preprocessor.py CHANGED Viewed

@@ -1,25 +1,73 @@
 import re
 import pandas as pd
-# from sentiment_train import predict_sentiment
-from sentiment import predict_sentiment_bert_batch
 import spacy
-from langdetect import detect, LangDetectException
-from sklearn.feature_extraction.text import CountVectorizer
 from sklearn.decomposition import LatentDirichletAllocation
 from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
 from spacy.lang.fr.stop_words import STOP_WORDS as FRENCH_STOP_WORDS
-from sklearn.feature_extraction.text import TfidfVectorizer
 from sklearn.cluster import KMeans
 from sklearn.manifold import TSNE
 import numpy as np
-# Load language models
-nlp_fr = spacy.load("fr_core_news_sm")
-nlp_en = spacy.load("en_core_web_sm")
-# Merge English and French stop words
 custom_stop_words = list(ENGLISH_STOP_WORDS.union(FRENCH_STOP_WORDS))
 def lemmatize_text(text, lang):
     if lang == 'fr':
         doc = nlp_fr(text)
@@ -27,166 +75,94 @@ def lemmatize_text(text, lang):
         doc = nlp_en(text)
     return " ".join([token.lemma_ for token in doc if not token.is_punct])
-def clean_message(text):
-    """ Remove media notifications, special characters, and unwanted symbols. """
-    if not isinstance(text, str):
-        return ""
-    text = text.lower()  # Convert to lowercase
-    text = re.sub(r"<media omitted>", "", text)  # Remove media notifications
-    text = re.sub(r"this message was deleted", "", text)
-    text = re.sub(r"null", "", text)
-    text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)  # Remove links
-    text = re.sub(r"[^a-zA-ZÀ-ÿ0-9\s]", "", text)  # Remove special characters
-    return text
-from sklearn.feature_extraction.text import TfidfVectorizer
-from sklearn.cluster import KMeans
-from sklearn.manifold import TSNE
-import numpy as np
-def preprocess_for_clustering(df, n_clusters=5):
-    """
-    Preprocess messages for clustering.
-    Args:
-        df (pd.DataFrame): DataFrame containing the 'lemmatized_message' column.
-        n_clusters (int): Number of clusters to create.
-    Returns:
-        df (pd.DataFrame): DataFrame with added 'cluster' column.
-        cluster_centers (np.array): Cluster centroids.
-    """
-    # Step 1: Vectorize text using TF-IDF
-    vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
-    tfidf_matrix = vectorizer.fit_transform(df['lemmatized_message'])
-    # Step 2: Apply K-Means clustering
-    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
-    clusters = kmeans.fit_predict(tfidf_matrix)
-    # Step 3: Add cluster labels to DataFrame
-    df['cluster'] = clusters
-    # Step 4: Reduce dimensionality for visualization
-    tsne = TSNE(n_components=2, random_state=42)
-    reduced_features = tsne.fit_transform(tfidf_matrix.toarray())
-    return df, reduced_features, kmeans.cluster_centers_
-def parse_data(data):
-    """
-    Parses the raw chat data into a DataFrame and performs basic cleaning.
-    """
-    # Optimization: Use pandas vectorized string operations instead of looping
-    # Split lines
-    lines = data.strip().split("\n")
-    df = pd.DataFrame({'line': lines})
-    # Extract Date, Time, Sender, Message using regex
     pattern = r"^(?P<Date>\d{1,2}/\d{1,2}/\d{2,4}),\s+(?P<Time>[\d:]+(?:\S*\s?[AP]M)?)\s+-\s+(?:(?P<Sender>.*?):\s+)?(?P<Message>.*)$"
-    extracted = df['line'].str.extract(pattern)
-    # Drop lines that didn't match (if any)
-    extracted = extracted.dropna(subset=['Date', 'Time', 'Message'])
-    # Combine Date and Time
-    extracted['Time'] = extracted['Time'].str.replace('â€¯', ' ', regex=False)
-    extracted['message_date'] = extracted['Date'] + ", " + extracted['Time']
-    # Handle Sender
-    extracted['Sender'] = extracted['Sender'].fillna('group_notification')
-    # Rename columns
-    df = extracted.rename(columns={'Sender': 'user', 'Message': 'message'})
-    # Filter out system messages
-    df = df[df['user'].str.lower() != 'system']
-    # Convert date
-    df['date'] = pd.to_datetime(df['message_date'], format='%m/%d/%y, %I:%M %p', errors='coerce')
-    # Filter out invalid dates
-    df = df.dropna(subset=['date'])
-    # Filter out group notifications
-    df = df[df["user"] != "group_notification"]
-    df.reset_index(drop=True, inplace=True)
-    # unfiltered  messages
     df["unfiltered_messages"] = df["message"]
-    # Clean messages
     df["message"] = df["message"].apply(clean_message)
     # Extract time-based features
-    df['year'] = df['date'].dt.year
     df['month'] = df['date'].dt.month_name()
-    df['day'] = df['date'].dt.day
-    df['hour'] = df['date'].dt.hour
     df['day_of_week'] = df['date'].dt.day_name()
-    df['minute'] = df['date'].dt.minute
-    period = []
-    for hour in df['hour']:
-        if hour == 23:
-            period.append(str(hour) + "-" + str('00'))
-        elif hour == 0:
-            period.append(str('00') + "-" + str(hour + 1))
-        else:
-            period.append(str(hour) + "-" + str(hour + 1))
-    df['period'] = period
-    return df
-def analyze_sentiment_and_topics(df):
-    """
-    Performs heavy NLP tasks: Lemmatization, Sentiment Analysis, and Topic Modeling.
-    Includes sampling for large datasets.
-    """
-    # Sampling Logic: Cap at 5000 messages for deep analysis
-    original_df_len = len(df)
-    if len(df) > 5000:
-        print(f"Sampling 5000 messages from {len(df)}...")
-        # We keep the original index to potentially map back, but for now we just work on the sample
-        df_sample = df.sample(5000, random_state=42).copy()
-    else:
-        df_sample = df.copy()
-    # Filter and lemmatize messages
-    lemmatized_messages = []
-    # Optimization: Detect dominant language on a sample
-    sample_size = min(len(df_sample), 500)
-    sample_text = " ".join(df_sample["message"].sample(sample_size, random_state=42).tolist())
-    try:
-        dominant_lang = detect(sample_text)
-    except LangDetectException:
-        dominant_lang = 'en'
-    nlp = nlp_fr if dominant_lang == 'fr' else nlp_en
-    # Use nlp.pipe for batch processing
     lemmatized_messages = []
-    for doc in nlp.pipe(df_sample["message"].tolist(), batch_size=1000, disable=["ner", "parser"]):
-        lemmatized_messages.append(" ".join([token.lemma_ for token in doc if not token.is_punct]))
-    df_sample["lemmatized_message"] = lemmatized_messages
-    # Apply sentiment analysis
-    # Use batch processing for speed
-    df_sample['sentiment'] = predict_sentiment_bert_batch(df_sample["message"].tolist(), batch_size=128)
-    # Filter out rows with null lemmatized_message
-    df_sample = df_sample.dropna(subset=['lemmatized_message'])
-    # **Fix: Use a custom stop word list**
     vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words=custom_stop_words)
-    try:
-        dtm = vectorizer.fit_transform(df_sample['lemmatized_message'])
-    except ValueError:
-        # Handle case where vocabulary is empty (e.g. all stop words)
-        print("Warning: Empty vocabulary after filtering. Returning empty topics.")
-        return df_sample, []
     # Apply LDA
     lda = LatentDirichletAllocation(n_components=5, random_state=42)
@@ -194,17 +170,63 @@ def analyze_sentiment_and_topics(df):
     # Assign topics to messages
     topic_results = lda.transform(dtm)
-    df_sample = df_sample.iloc[:topic_results.shape[0]].copy()
-    df_sample['topic'] = topic_results.argmax(axis=1)
     # Store topics for visualization
     topics = []
     for topic in lda.components_:
         topics.append([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-10:]])
-    # If we sampled, we return the sampled dataframe with sentiment/topics.
-    # The main app will need to handle that 'df' (full) and 'df_analyzed' (sample) might be different.
-    # Or we can try to merge back? Merging back 5000 sentiments to 40000 messages leaves 35000 nulls.
-    # For visualization purposes (pie charts, etc), using the sample is usually fine as it's representative.
-    return df_sample, topics

 import re
 import pandas as pd
 import spacy
+from langdetect import detect_langs
+from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
 from sklearn.decomposition import LatentDirichletAllocation
 from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
 from spacy.lang.fr.stop_words import STOP_WORDS as FRENCH_STOP_WORDS
 from sklearn.cluster import KMeans
 from sklearn.manifold import TSNE
 import numpy as np
+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
+import streamlit as st
+from datetime import datetime
+# Lighter model
+MODEL ="cardiffnlp/twitter-xlm-roberta-base-sentiment"
+# Cache model loading with fallback for quantization
+@st.cache_resource
+def load_model():
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    print(f"Using device: {device}")
+    tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)
+    model = AutoModelForSequenceClassification.from_pretrained(MODEL).to(device)
+    # Attempt quantization with fallback
+    try:
+        # Set quantization engine explicitly (fbgemm for x86, qnnpack for ARM)
+        torch.backends.quantized.engine = 'fbgemm' if torch.cuda.is_available() else 'qnnpack'
+        model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
+        print("Model quantized successfully.")
+    except RuntimeError as e:
+        print(f"Quantization failed: {e}. Using non-quantized model.")
+    config = AutoConfig.from_pretrained(MODEL)
+    return tokenizer, model, config, device
+tokenizer, model, config, device = load_model()
+nlp_fr = spacy.load("fr_core_news_sm")
+nlp_en = spacy.load("en_core_web_sm")
 custom_stop_words = list(ENGLISH_STOP_WORDS.union(FRENCH_STOP_WORDS))
+def preprocess(text):
+    if text is None:
+        return ""
+    if not isinstance(text, str):
+        try:
+            text = str(text)
+        except:
+            return ""
+    new_text = []
+    for t in text.split(" "):
+        t = '@user' if t.startswith('@') and len(t) > 1 else t
+        t = 'http' if t.startswith('http') else t
+        new_text.append(t)
+    return " ".join(new_text)
+def clean_message(text):
+    if not isinstance(text, str):
+        return ""
+    text = text.lower()
+    text = text.replace("<media omitted>", "").replace("this message was deleted", "").replace("null", "")
+    text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)
+    text = re.sub(r"[^a-zA-ZÀ-ÿ0-9\s]", "", text)
+    return text.strip()
 def lemmatize_text(text, lang):
     if lang == 'fr':
         doc = nlp_fr(text)
         doc = nlp_en(text)
     return " ".join([token.lemma_ for token in doc if not token.is_punct])
+def preprocess(data):
     pattern = r"^(?P<Date>\d{1,2}/\d{1,2}/\d{2,4}),\s+(?P<Time>[\d:]+(?:\S*\s?[AP]M)?)\s+-\s+(?:(?P<Sender>.*?):\s+)?(?P<Message>.*)$"
+    filtered_messages, valid_dates = [], []
+    for line in data.strip().split("\n"):
+        match = re.match(pattern, line)
+        if match:
+            entry = match.groupdict()
+            sender = entry.get("Sender")
+            if sender and sender.strip().lower() != "system":
+                filtered_messages.append(f"{sender.strip()}: {entry['Message']}")
+                valid_dates.append(f"{entry['Date']}, {entry['Time'].replace('â€¯', ' ')}")
+    print("-_____--------------__________----------_____________----------______________")
+    def convert_to_target_format(date_str):
+        try:
+            # Attempt to parse the original date string
+            dt = datetime.strptime(date_str, '%d/%m/%Y, %H:%M')
+        except ValueError:
+            # Return the original date string if parsing fails
+            return date_str
+        # Extract components without leading zeros
+        month = dt.month
+        day = dt.day
+        year_short = dt.strftime('%y')  # Last two digits of the year
+        # Convert to 12-hour format and determine AM/PM
+        hour_12 = dt.hour % 12
+        if hour_12 == 0:
+            hour_12 = 12  # Adjust 0 (from 12 AM/PM) to 12
+        hour_str = str(hour_12)
+        # Format minute with leading zero if necessary
+        minute_str = f"{dt.minute:02d}"
+        # Get AM/PM designation
+        am_pm = dt.strftime('%p')
+        # Construct the formatted date string with Unicode narrow space
+        return f"{month}/{day}/{year_short}, {hour_str}:{minute_str}\u202f{am_pm}"
+    converted_dates = [convert_to_target_format(date) for date in valid_dates]
+    df = pd.DataFrame({'user_message': filtered_messages, 'message_date': converted_dates})
+    df['message_date'] = pd.to_datetime(df['message_date'], format='%m/%d/%y, %I:%M %p', errors='coerce')
+    df.rename(columns={'message_date': 'date'}, inplace=True)
+    users, messages = [], []
+    msg_pattern = r"^(.*?):\s(.*)$"
+    for message in df["user_message"]:
+        match = re.match(msg_pattern, message)
+        if match:
+            users.append(match.group(1))
+            messages.append(match.group(2))
+        else:
+            users.append("group_notification")
+            messages.append(message)
+    df["user"] = users
+    df["message"] = messages
+    df = df[df["user"] != "group_notification"].reset_index(drop=True)
     df["unfiltered_messages"] = df["message"]
     df["message"] = df["message"].apply(clean_message)
     # Extract time-based features
+    df['year'] = pd.to_numeric(df['date'].dt.year, downcast='integer')
     df['month'] = df['date'].dt.month_name()
+    df['day'] = pd.to_numeric(df['date'].dt.day, downcast='integer')
+    df['hour'] = pd.to_numeric(df['date'].dt.hour, downcast='integer')
     df['day_of_week'] = df['date'].dt.day_name()
+    # Lemmatize messages for topic modeling
     lemmatized_messages = []
+    for message in df["message"]:
+        try:
+            lang = detect_langs(message)
+            lemmatized_messages.append(lemmatize_text(message, lang))
+        except:
+            lemmatized_messages.append("")
+    df["lemmatized_message"] = lemmatized_messages
+    df = df[df["message"].notnull() & (df["message"] != "")].copy()
+    df.drop(columns=["user_message"], inplace=True)
+    # Perform topic modeling
     vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words=custom_stop_words)
+    dtm = vectorizer.fit_transform(df['lemmatized_message'])
     # Apply LDA
     lda = LatentDirichletAllocation(n_components=5, random_state=42)
     # Assign topics to messages
     topic_results = lda.transform(dtm)
+    df = df.iloc[:topic_results.shape[0]].copy()
+    df['topic'] = topic_results.argmax(axis=1)
     # Store topics for visualization
     topics = []
     for topic in lda.components_:
         topics.append([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-10:]])
+    print("Top words for each topic-----------------------------------------------------:")
+    print(topics)
+    return df, topics
+def preprocess_for_clustering(df, n_clusters=5):
+    df = df[df["lemmatized_message"].notnull() & (df["lemmatized_message"].str.strip() != "")]
+    df = df.reset_index(drop=True)
+    vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
+    tfidf_matrix = vectorizer.fit_transform(df['lemmatized_message'])
+    if tfidf_matrix.shape[0] < 2:
+        raise ValueError("Not enough messages for clustering.")
+    df = df.iloc[:tfidf_matrix.shape[0]].copy()
+    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
+    clusters = kmeans.fit_predict(tfidf_matrix)
+    df['cluster'] = clusters
+    tsne = TSNE(n_components=2, random_state=42)
+    reduced_features = tsne.fit_transform(tfidf_matrix.toarray())
+    return df, reduced_features, kmeans.cluster_centers_
+def predict_sentiment_batch(texts: list, batch_size: int = 32) -> list:
+    """Predict sentiment for a batch of texts"""
+    if not isinstance(texts, list):
+        raise TypeError(f"Expected list of texts, got {type(texts)}")
+    processed_texts = [preprocess(text) for text in texts]
+    predictions = []
+    for i in range(0, len(processed_texts), batch_size):
+        batch = processed_texts[i:i+batch_size]
+        inputs = tokenizer(
+            batch,
+            padding=True,
+            truncation=True,
+            return_tensors="pt",
+            max_length=128
+        ).to(device)
+        with torch.no_grad():
+            outputs = model(**inputs)
+        batch_preds = outputs.logits.argmax(dim=1).cpu().numpy()
+        predictions.extend([config.id2label[p] for p in batch_preds])
+    return predictions

profile_performance.py DELETED Viewed

@@ -1,70 +0,0 @@
-import time
-import pandas as pd
-import preprocessor
-import random
-def generate_large_chat(lines=10000):
-    """Generates a synthetic WhatsApp chat log."""
-    senders = ["User1", "User2", "User3"]
-    messages = [
-        "Hello there, how are you?",
-        "I am doing great, thanks for asking! Project update?",
-        "This is a test message to simulate a long chat about artificial intelligence.",
-        "Meeting is at 10 AM tomorrow to discuss the roadmap.",
-        "Check out this link: https://example.com",
-        "Haha that is funny 😂",
-        "Je parle un peu français aussi. C'est la vie.",
-        "Non, je ne crois pas. Il fait beau aujourd'hui.",
-        "Ok, see you later. Don't forget the deadline.",
-        "Python is a great programming language for data science.",
-        "Streamlit makes building apps very easy and fast."
-    ]
-    chat_data = []
-    for _ in range(lines):
-        date = f"{random.randint(1, 12)}/{random.randint(1, 28)}/23"
-        hour = random.randint(1, 12)
-        minute = random.randint(10, 59)
-        ampm = random.choice(["AM", "PM"])
-        time_str = f"{hour}:{minute} {ampm}"
-        sender = random.choice(senders)
-        message = random.choice(messages)
-        chat_data.append(f"{date}, {time_str} - {sender}: {message}")
-    return "\n".join(chat_data)
-def profile_preprocessing():
-    print("Generating synthetic data (10,000 lines)...")
-    raw_data = generate_large_chat(10000)
-    print(f"Data size: {len(raw_data) / 1024 / 1024:.2f} MB")
-    print("\nStarting profiling...")
-    start_total = time.time()
-    # We can't easily profile inside the function without modifying it,
-    # so we will measure the total time and infer from code analysis
-    # or modify preprocessor.py temporarily to print timings.
-    # For now, let's just run it and see the total time.
-    try:
-        start_time = time.time()
-        # Step 1: Parse
-        df = preprocessor.parse_data(raw_data)
-        print(f"Parsing took: {time.time() - start_time:.2f}s")
-        # Step 2: Analyze
-        step_start = time.time()
-        df, topics = preprocessor.analyze_sentiment_and_topics(df)
-        print(f"Analysis took: {time.time() - step_start:.2f}s")
-        end_total = time.time()
-        print(f"\nTotal Preprocessing Time: {end_total - start_total:.2f} seconds")
-        print(f"Messages processed: {len(df)}")
-    except Exception as e:
-        print(f"Error: {e}")
-        import traceback
-        traceback.print_exc()
-if __name__ == "__main__":
-    profile_preprocessing()

reproduce_issue.py DELETED Viewed

@@ -1,27 +0,0 @@
-import pandas as pd
-from collections import Counter
-import emoji
-def emoji_helper_simulated(emojis_list):
-    emoji_df = pd.DataFrame(Counter(emojis_list).most_common(len(Counter(emojis_list))))
-    return emoji_df
-# Case 1: Emojis present
-print("Case 1: Emojis present")
-df1 = emoji_helper_simulated(['😀', '😀', '😂'])
-print(df1)
-try:
-    print(df1[1].head())
-    print("Access successful")
-except KeyError as e:
-    print(f"KeyError: {e}")
-# Case 2: No emojis
-print("\nCase 2: No emojis")
-df2 = emoji_helper_simulated([])
-print(df2)
-try:
-    print(df2[1].head())
-    print("Access successful")
-except KeyError as e:
-    print(f"KeyError: {e}")

requirements.txt CHANGED Viewed

@@ -1,6 +1,6 @@
 streamlit
-matplotlib==3.7.1
 preprocessor
 seaborn
 urlextract
 wordcloud
@@ -18,7 +18,6 @@ plotly
 nltk
 spacy==3.7.0
 thinc>=8.1.8,<8.3.0
-python-dotenv
 deep_translator
 https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl
 https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.7.0/fr_core_news_sm-3.7.0-py3-none-any.whl

 streamlit
 preprocessor
+matplotlib
 seaborn
 urlextract
 wordcloud
 nltk
 spacy==3.7.0
 thinc>=8.1.8,<8.3.0
 deep_translator
 https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl
 https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.7.0/fr_core_news_sm-3.7.0-py3-none-any.whl

sentiment.py CHANGED Viewed

@@ -1,27 +1,27 @@
 import pandas as pd
 import torch
-from sklearn.metrics import accuracy_score, classification_report
 from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
-MODEL = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
-# Check if GPU is available
-# Check if GPU is available (CUDA or MPS)
-if torch.cuda.is_available():
-    device = torch.device("cuda")
-elif torch.backends.mps.is_available():
-    device = torch.device("mps")
-else:
-    device = torch.device("cpu")
 print(f"Using device: {device}")
-# Load the model and tokenizer
-model = AutoModelForSequenceClassification.from_pretrained(MODEL)
-model.to(device)  # Move model to GPU
-tokenizer = AutoTokenizer.from_pretrained(MODEL)
 config = AutoConfig.from_pretrained(MODEL)
-# Preprocess text (username and link placeholders)
 def preprocess(text):
     if not isinstance(text, str):
         text = str(text) if not pd.isna(text) else ""
@@ -31,58 +31,68 @@ def preprocess(text):
         t = '@user' if t.startswith('@') and len(t) > 1 else t
         t = 'http' if t.startswith('http') else t
         new_text.append(t)
     return " ".join(new_text)
-# Sentiment prediction using GPU (Batch Processing)
-def predict_sentiment_bert_batch(texts: list, batch_size: int = 32) -> list:
-    all_sentiments = []
-    # Preprocess all texts
-    processed_texts = [preprocess(text) for text in texts]
-    # Process in batches
     for i in range(0, len(processed_texts), batch_size):
-        batch_texts = processed_texts[i:i + batch_size]
-        encoded_input = tokenizer(
-            batch_texts,
-            return_tensors='pt',
-            truncation=True,
-            padding=True,
-            max_length=128
-        )
-        encoded_input.pop("token_type_ids", None)  # XLM-Roberta doesn't use these
-        # Move input tensors to the same device as the model
-        encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
-        model.eval()
-        with torch.no_grad():
-            output = model(**encoded_input)
-        # Get predictions for the batch
-        batch_indices = output.logits.argmax(dim=1).tolist()
-        batch_sentiments = [config.id2label[idx] for idx in batch_indices]
-        all_sentiments.extend(batch_sentiments)
-    return all_sentiments
-# Keep single prediction for backward compatibility if needed, but it calls batch with size 1
-def predict_sentiment_bert(text: str) -> str:
-    return predict_sentiment_bert_batch([text], batch_size=1)[0]
-# Example usage (optional)
-# print(predict_sentiment_bert("This is amazing!"))
-# Predict on full dataset (uncomment when ready)
-# test_data['predicted_sentiment'] = test_data['text'].apply(predict_sentiment_bert)
-# Calculate accuracy (uncomment when ready)
-# accuracy = accuracy_score(test_labels['label'], test_data['predicted_sentiment'])
-# print(f"Accuracy: {accuracy:.2f}")
-# Generate classification report (uncomment when ready)
-# report = classification_report(test_labels['label'], test_data['predicted_sentiment'])
-# print("Classification Report:\n", report)

 import pandas as pd
+import time
 import torch
 from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
+# Use a sentiment-specific model (replace with TinyBERT if fine-tuned)
+MODEL = "tabularisai/multilingual-sentiment-analysis"  # Pre-trained for positive/negative sentiment
+print("Loading model and tokenizer...")
+start_load = time.time()
+# Check for MPS (Metal) availability on M2 chip, fallback to CPU
+device = "mps" if torch.backends.mps.is_available() else "cpu"
 print(f"Using device: {device}")
+# Load with optimizations (only once, removing redundancy)
+tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)
+model = AutoModelForSequenceClassification.from_pretrained(MODEL).to(device)
 config = AutoConfig.from_pretrained(MODEL)
+load_time = time.time() - start_load
+print(f"Model and tokenizer loaded in {load_time:.2f} seconds\n")
+# Optimized preprocessing (unchanged from your code)
 def preprocess(text):
     if not isinstance(text, str):
         text = str(text) if not pd.isna(text) else ""
         t = '@user' if t.startswith('@') and len(t) > 1 else t
         t = 'http' if t.startswith('http') else t
         new_text.append(t)
     return " ".join(new_text)
+# Batch prediction function (optimized for performance)
+def predict_sentiment_batch(texts: list, batch_size: int = 16) -> list:
+    if not isinstance(texts, list):
+        raise TypeError(f"Expected list of texts, got {type(texts)}")
+    # Validate and clean inputs
+    valid_texts = [str(text) for text in texts if isinstance(text, str) and text.strip()]
+    if not valid_texts:
+        return []  # Return empty list if no valid texts
+    print(f"Processing {len(valid_texts)} valid samples...")
+    processed_texts = [preprocess(text) for text in valid_texts]
+    predictions = []
     for i in range(0, len(processed_texts), batch_size):
+        batch = processed_texts[i:i + batch_size]
+        try:
+            inputs = tokenizer(
+                batch,
+                padding=True,
+                truncation=True,
+                return_tensors="pt",
+                max_length=64  # Reduced for speed on short texts like tweets
+            ).to(device)
+            with torch.no_grad():
+                outputs = model(**inputs)
+            batch_preds = outputs.logits.argmax(dim=1).cpu().numpy()
+            predictions.extend([config.id2label[p] for p in batch_preds])
+        except Exception as e:
+            print(f"Error processing batch {i // batch_size}: {str(e)}")
+            predictions.extend(["neutral"] * len(batch))  # Consider logging instead
+    print(f"Predictions for {len(valid_texts)} samples generated in {time.time() - start_load:.2f} seconds")
+    predictions = [prediction.lower().replace("very ", "") for prediction in predictions]
+    print(predictions)
+    return predictions
+# # Example usage with your dataset (uncomment and adjust paths)
+# test_data = pd.read_csv("/Users/caasidev/development/AI/last try/Whatssap-project/srcs/tweets.csv")
+# print(f"Processing {len(test_data)} samples...")
+# start_prediction = time.time()
+# text_samples = test_data['text'].tolist()
+# test_data['predicted_sentiment'] = predict_sentiment_batch(text_samples)
+# prediction_time = time.time() - start_prediction
+# time_per_sample = prediction_time / len(test_data)
+# # Print runtime statistics
+# print("\nRuntime Statistics:")
+# print(f"- Model loading time: {load_time:.2f} seconds")
+# print(f"- Total prediction time for {len(test_data)} samples: {prediction_time:.2f} seconds")
+# print(f"- Average time per sample: {time_per_sample:.4f} seconds")
+# print(f"- Estimated time for 1000 samples: {(time_per_sample * 1000):.2f} seconds")
+# print(f"- Estimated time for 20000 samples: {(time_per_sample * 20000 / 60):.2f} minutes")
+# # Print a sample of predictions
+# print("\nPredicted Sentiments (first 5 samples):")
+# print(test_data[['text', 'predicted_sentiment']].head())

sentiment_train.py DELETED Viewed

@@ -1,41 +0,0 @@
-import os
-import joblib
-import string
-import re
-import nltk
-from nltk.corpus import stopwords
-# Get the directory of the current script
-script_dir = os.path.dirname(__file__)
-# Construct paths to the model and vectorizer files
-model_path = os.path.join(script_dir, "naive_bayes_model.pkl")
-vectorizer_path = os.path.join(script_dir, "tfidf_vectorizer.pkl")
-# Load saved model and vectorizer
-try:
-    model = joblib.load(model_path)
-    vectorizer = joblib.load(vectorizer_path)
-except FileNotFoundError as e:
-    print(f"Error: {e}")
-    raise
-# Load stopwords
-nltk.download("stopwords")
-stop_words = set(stopwords.words("english") + stopwords.words("french"))
-# Function to clean text (must match preprocessing in training script)
-def clean_text(text):
-    if isinstance(text, float):
-        return ""
-    text = text.lower()
-    text = re.sub(f"[{string.punctuation}]", "", text)
-    text = " ".join([word for word in text.split() if word not in stop_words])
-    return text
-# Function to predict sentiment
-def predict_sentiment(text):
-    cleaned_text = clean_text(text)
-    vectorized_text = vectorizer.transform([cleaned_text])
-    prediction = model.predict(vectorized_text)
-    return prediction[0]

test.py DELETED Viewed

@@ -1,67 +0,0 @@
-import nltk
-import string
-import re
-import pandas as pd
-import numpy as np
-import joblib
-from nltk.corpus import stopwords
-from sklearn.feature_extraction.text import TfidfVectorizer
-from sklearn.model_selection import train_test_split
-from sklearn.naive_bayes import MultinomialNB
-from sklearn.metrics import accuracy_score, classification_report
-from googletrans import Translator
-from imblearn.over_sampling import SMOTE
-nltk.download('stopwords')
-nltk.download('punkt')
-translator = Translator()
-# Load dataset
-data = pd.read_csv('/Users/caasidev/development/AI/datasets/train.csv', encoding='ISO-8859-1')
-# Drop missing values
-data = data.dropna(subset=['text', 'sentiment'])
-stop_words = set(stopwords.words('english') + stopwords.words('french'))
-# Function to clean text
-def clean_text(text):
-    if isinstance(text, float):
-        return ""
-    text = text.lower()
-    text = re.sub(f"[{string.punctuation}]", "", text)
-    text = " ".join([word for word in text.split() if word not in stop_words])
-    return text
-# Apply text cleaning
-data['Cleaned_Text'] = data['text'].apply(clean_text)
-# **Vectorization BEFORE SMOTE**
-vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_df=0.85, min_df=2, max_features=10000)
-X_tfidf = vectorizer.fit_transform(data['Cleaned_Text'])
-y = data['sentiment']
-# Apply SMOTE **after** vectorization
-smote = SMOTE(random_state=42)
-X_resampled, y_resampled = smote.fit_resample(X_tfidf, y)
-# Train-test split
-X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)
-# Train Naive Bayes
-model = MultinomialNB(alpha=0.5)
-model.fit(X_train, y_train)
-# Save model and vectorizer
-joblib.dump(model, "naive_bayes_model.pkl")
-joblib.dump(vectorizer, "tfidf_vectorizer.pkl")
-print("Model and vectorizer saved successfully!")
-# Predictions
-y_pred = model.predict(X_test)
-# Evaluation
-accuracy = accuracy_score(y_test, y_pred)
-print(f"Improved Accuracy: {accuracy * 100:.2f}%")
-print("\nClassification Report:\n", classification_report(y_test, y_pred))

tfidf_vectorizer.pkl DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:a6f08647a3d11077c7c5922973244b227a8d2b2e7ac46707d9f258c8a92de1b5
-size 375493

verify_fix.py DELETED Viewed

@@ -1,48 +0,0 @@
-import pandas as pd
-import helper
-# Create a dummy dataframe with no emojis
-data = {
-    'user': ['User1', 'User2'],
-    'message': ['Hello world', 'This is a test'],
-    'unfiltered_messages': ['Hello world', 'This is a test']
-}
-df = pd.DataFrame(data)
-print("Testing emoji_helper with no emojis...")
-try:
-    emoji_df = helper.emoji_helper('Overall', df)
-    print("emoji_df columns:", emoji_df.columns.tolist())
-    print("emoji_df shape:", emoji_df.shape)
-    if 0 in emoji_df.columns and 1 in emoji_df.columns:
-        print("SUCCESS: Columns 0 and 1 exist.")
-    else:
-        print("FAILURE: Columns 0 and 1 missing.")
-except Exception as e:
-    print(f"FAILURE: Exception occurred: {e}")
-# Test with emojis to ensure no regression
-data_with_emoji = {
-    'user': ['User1'],
-    'message': ['Hello 😀'],
-    'unfiltered_messages': ['Hello 😀']
-}
-df_emoji = pd.DataFrame(data_with_emoji)
-print("\nTesting emoji_helper with emojis...")
-try:
-    emoji_df = helper.emoji_helper('Overall', df_emoji)
-    print("emoji_df columns:", emoji_df.columns.tolist())
-    print("emoji_df shape:", emoji_df.shape)
-    if 0 in emoji_df.columns and 1 in emoji_df.columns:
-        print("SUCCESS: Columns 0 and 1 exist.")
-    else:
-        print("FAILURE: Columns 0 and 1 missing.")
-except Exception as e:
-    print(f"FAILURE: Exception occurred: {e}")

verify_refactor.py DELETED Viewed

@@ -1,41 +0,0 @@
-import helper
-from openrouter_chat import get_chat_completion
-def test_openrouter_connection():
-    print("Testing OpenRouter connection...")
-    try:
-        response = get_chat_completion([{"role": "user", "content": "Hello"}])
-        if response:
-            print("✅ OpenRouter connection successful.")
-        else:
-            print("❌ OpenRouter connection failed (empty response).")
-    except Exception as e:
-        print(f"❌ OpenRouter connection failed: {e}")
-def test_title_generation():
-    print("\nTesting Title Generation from Messages...")
-    messages = [
-        "Hey, are we still on for the movies tonight?",
-        "Yeah, lets go watch that new sci-fi one.",
-        "Cool, I'll buy the tickets online.",
-        "Meet you at the cinema at 7?"
-    ]
-    topic_map = {0: messages}
-    try:
-        titles = helper.generate_topic_titles_from_messages(topic_map)
-        title = titles.get(0)
-        print(f"Generated Title: {title}")
-        if title and title != "Topic 0":
-            print("✅ Title generation successful.")
-        else:
-            print("⚠️ Title generation returned default or empty.")
-    except Exception as e:
-        print(f"❌ Title generation failed: {e}")
-if __name__ == "__main__":
-    test_openrouter_connection()
-    test_title_generation()