Files changed (21) hide show
  1. .DS_Store +0 -0
  2. .env +0 -1
  3. .gitignore +0 -10
  4. .gitignore.save +0 -3
  5. README.md +147 -19
  6. app.py +399 -255
  7. debug_regex.py +0 -23
  8. finetune.py +0 -102
  9. helper.py +91 -195
  10. naive_bayes_model.pkl +0 -3
  11. openrouter_chat.py +0 -91
  12. preprocessor.py +187 -165
  13. profile_performance.py +0 -70
  14. reproduce_issue.py +0 -27
  15. requirements.txt +1 -2
  16. sentiment.py +68 -58
  17. sentiment_train.py +0 -41
  18. test.py +0 -67
  19. tfidf_vectorizer.pkl +0 -3
  20. verify_fix.py +0 -48
  21. verify_refactor.py +0 -41
.DS_Store DELETED
Binary file (6.15 kB)
 
.env DELETED
@@ -1 +0,0 @@
1
- OPENROUTER_API_KEY="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"
 
 
.gitignore DELETED
@@ -1,10 +0,0 @@
1
- venv/
2
- .env/
3
- __pycache__/
4
- *.pyc
5
- .streamlit/secrets.toml
6
- .streamlit/secrets.toml
7
- .streamlit/secrets.toml
8
- .venv/
9
- venv/
10
- .venv/
 
 
 
 
 
 
 
 
 
 
 
.gitignore.save DELETED
@@ -1,3 +0,0 @@
1
- .venv/
2
-
3
-
 
 
 
 
README.md CHANGED
@@ -1,25 +1,153 @@
1
- ---
2
- title: WhatsApp Chat
3
- emoji: 😻
4
- colorFrom: pink
5
- colorTo: gray
6
- sdk: streamlit
7
- sdk_version: 1.44.1
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- short_description: πŸ“Š WhatsApp Chat Sentiment Analysis
12
- ---
13
-
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
15
- ## Running Locally
16
-
17
- 1. Install dependencies:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  ```bash
19
  pip install -r requirements.txt
20
  ```
21
 
22
- 2. Run the application:
23
  ```bash
24
- streamlit run app.py
25
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # WhatsApp Chat Analyzer
2
+
3
+ A comprehensive tool for analyzing WhatsApp chat exports with sentiment analysis capabilities.
4
+
5
+ ## Table of Contents
6
+ 1. [System Overview](#system-overview)
7
+ 2. [Architecture](#architecture)
8
+ 3. [Components](#components)
9
+ 4. [Data Flow](#data-flow)
10
+ 5. [Installation](#installation)
11
+ 6. [Usage](#usage)
12
+ 7. [Analysis Capabilities](#analysis-capabilities)
13
+ 8.
14
+
15
+ ## System Overview
16
+
17
+ The WhatsApp Chat Analyzer is a Python-based application that processes exported WhatsApp chat data to provide:
18
+ - Message statistics and metrics
19
+ - Temporal activity patterns
20
+ - User engagement analysis
21
+ - Content analysis (words, emojis, links)
22
+ - Sentiment analysis capabilities
23
+ - Topics analysis in the group chats
24
+
25
+ Built with Streamlit for the web interface, it offers an interactive way to explore chat dynamics and analyze sentiment.
26
+
27
+ ## Architecture
28
+
29
+ The system follows a modular architecture with clear separation of concerns:
30
+
31
+ ```
32
+ Raw WhatsApp Chat β†’ Preprocessing β†’ Analysis β†’ Visualization
33
+ ```
34
+
35
+ Key architectural decisions:
36
+ - **Modular Design**: Components are separated by functionality
37
+ - **Pipeline Processing**: Data flows through discrete processing stages
38
+ - **Interactive UI**: Streamlit enables real-time exploration
39
+
40
+ ## Components
41
+
42
+ ### 1. App Module (`app.py`)
43
+ - **Responsibility**: User interface and visualization
44
+ - **Key Features**:
45
+ - File upload handling
46
+ - User selection interface
47
+ - Visualization rendering
48
+ - Interactive controls
49
+
50
+ ### 2. Preprocessor (`preprocessor.py`)
51
+ - **Responsibility**: Data cleaning and structuring
52
+ - **Key Features**:
53
+ - Handles multiple date/time formats
54
+ - Extracts messages and metadata
55
+ - Filters system messages
56
+ - Creates structured DataFrame
57
+
58
+ ### 3. Helper Module (`helper.py`)
59
+ - **Responsibility**: Analytical computations
60
+ - **Key Features**:
61
+ - Statistical metrics
62
+ - Temporal analysis
63
+ - Content analysis
64
+ - Visualization data preparation
65
+
66
+ ### 4. Notebook (`whatsAppAnalyzer.ipynb`)
67
+ - **Responsibility**: Prototyping and experimentation
68
+ - **Key Features**:
69
+ - Initial pattern development
70
+ - Data exploration
71
+ - Algorithm testing
72
+
73
+ ## Data Flow
74
+
75
+ 1. **Input**: User uploads WhatsApp chat export (.txt)
76
+ 2. **Preprocessing**:
77
+ - Raw text is parsed using regex patterns
78
+ - Messages are categorized and timestamped
79
+ - Structured DataFrame is created
80
+ 3. **Analysis**:
81
+ - Selected metrics are computed
82
+ - Temporal patterns are identified
83
+ - Content features are extracted
84
+ 4. **Visualization**:
85
+ - Results are displayed in interactive charts
86
+ - User can explore different views
87
+
88
+ ## Installation
89
+
90
+ ### Prerequisites
91
+ - Python 3.8+
92
+ - pip package manager
93
+
94
+ ### Steps
95
+ 1. Clone the repository:
96
+ ```bash
97
+ git clone [repository-url]
98
+ cd whatsapp-analyzer
99
+ ```
100
+
101
+ 2. Install dependencies:
102
  ```bash
103
  pip install -r requirements.txt
104
  ```
105
 
106
+ 3. Run the application:
107
  ```bash
108
+ streamlit run srcs/app.py
109
  ```
110
+
111
+ ## Usage
112
+
113
+ 1. Launch the application
114
+ 2. Upload a WhatsApp chat export file
115
+ 3. Select a user or "Overall" for group analysis
116
+ 4. Explore the various analysis tabs:
117
+ - Statistics
118
+ - Timelines
119
+ - Activity Maps
120
+ - Word Clouds
121
+ - Emoji Analysis
122
+
123
+ ## Analysis Capabilities
124
+
125
+ ### 1. Basic Statistics
126
+ - Message counts
127
+ - Word counts
128
+ - Media shared
129
+ - Links shared
130
+
131
+ ### 2. Temporal Analysis
132
+ - Daily activity patterns
133
+ - Monthly trends
134
+ - Hourly distributions
135
+
136
+ ### 3. User Engagement
137
+ - Most active users
138
+ - User participation rates
139
+ - Message distribution
140
+
141
+ ### 4. Content Analysis
142
+ - Most common words
143
+ - Emoji usage
144
+
145
+ ### 5. Sentiment Analysis
146
+ - Message sentiment scoring
147
+ - Sentiment trends over time
148
+ - User sentiment comparison
149
+ ## 5. Topics Analysis
150
+ - Topic modeling
151
+ - Common topics over time
152
+ - User interests
153
+
app.py CHANGED
@@ -1,16 +1,15 @@
1
  import streamlit as st
 
 
2
  import pandas as pd
3
  import matplotlib.pyplot as plt
4
- import os
5
-
6
- # Silence tokenizers warning
7
- os.environ["TOKENIZERS_PARALLELISM"] = "false"
8
  import seaborn as sns
9
  import preprocessor, helper
10
- import calendar
 
 
11
 
12
  # Theme customization
13
- st.set_page_config(page_title="WhatsApp Chat Analyzer", layout="wide")
14
  st.markdown(
15
  """
16
  <style>
@@ -20,275 +19,420 @@ st.markdown(
20
  unsafe_allow_html=True
21
  )
22
 
 
 
 
23
  st.title("πŸ“Š WhatsApp Chat Sentiment Analysis Dashboard")
24
  st.subheader('Instructions')
25
- st.markdown("1. Open the side bar and upload your WhatsApp chat file in .txt format.")
26
- st.markdown("2. Wait for it to load")
27
- st.markdown("3. Once the data is loaded, you can customize the analysis by selecting specific users or filtering the data.")
28
- st.markdown("4. Click on the 'Show Analysis' button to update the analysis with your selected filters.")
29
 
30
  st.sidebar.title("Whatsapp Chat Analyzer")
 
31
 
32
- # OpenRouter API Key is now handled via .env and openrouter_chat.py
33
- # No need to request HF token from user
34
-
35
 
36
- uploaded_file = st.sidebar.file_uploader("Upload your chat file (.txt)", type="txt")
37
  if uploaded_file is not None:
38
  raw_data = uploaded_file.read().decode("utf-8")
39
-
40
- # Step 1: Fast Parsing (Lazy Loading)
41
- @st.cache_data
42
- def load_parsed_data(data):
43
- return preprocessor.parse_data(data)
44
-
45
- df = load_parsed_data(raw_data)
46
 
47
- # Sidebar filters
48
  st.sidebar.header("πŸ” Filters")
49
- user_list = df['user'].unique().tolist()
50
- if 'group_notification' in user_list:
51
- user_list.remove('group_notification')
52
- user_list.sort()
53
- user_list.insert(0, "Overall")
54
 
55
- selected_user = st.sidebar.selectbox("Show analysis wrt", user_list)
56
 
57
  if st.sidebar.button("Show Analysis"):
58
- # Basic Stats (Instant)
59
- num_messages, words, num_media_messages, num_links = helper.fetch_stats(selected_user, df)
60
- st.title("Top Statistics")
61
- col1, col2, col3, col4 = st.columns(4)
62
-
63
- with col1:
64
- st.header("Total Messages")
65
- st.title(num_messages)
66
- with col2:
67
- st.header("Total Words")
68
- st.title(words)
69
- with col3:
70
- st.header("Media Shared")
71
- st.title(num_media_messages)
72
- with col4:
73
- st.header("Links Shared")
74
- st.title(num_links)
75
-
76
- # Monthly Timeline
77
- st.title("Monthly Timeline")
78
- timeline = helper.monthly_timeline(selected_user, df)
79
- fig, ax = plt.subplots()
80
- ax.plot(timeline['time'], timeline['message'], color='green')
81
- plt.xticks(rotation='vertical')
82
- st.pyplot(fig)
83
-
84
- # Daily Timeline
85
- st.title("Daily Timeline")
86
- daily_timeline = helper.daily_timeline(selected_user, df)
87
- fig, ax = plt.subplots()
88
- ax.plot(daily_timeline['date'], daily_timeline['message'], color='black')
89
- plt.xticks(rotation='vertical')
90
- st.pyplot(fig)
91
-
92
- # Activity Map
93
- st.title('Activity Map')
94
- col1, col2 = st.columns(2)
95
-
96
- with col1:
97
- st.header("Most busy day")
98
- busy_day = helper.week_activity_map(selected_user, df)
99
- fig, ax = plt.subplots()
100
- ax.bar(busy_day.index, busy_day.values, color='purple')
101
- plt.xticks(rotation='vertical')
102
- st.pyplot(fig)
103
-
104
- with col2:
105
- st.header("Most busy month")
106
- busy_month = helper.month_activity_map(selected_user, df)
107
- fig, ax = plt.subplots()
108
- ax.bar(busy_month.index, busy_month.values, color='orange')
109
- plt.xticks(rotation='vertical')
110
- st.pyplot(fig)
111
-
112
- # st.title("Weekly Activity Map")
113
- # user_heatmap = helper.activity_heatmap(selected_user, df)
114
- # fig, ax = plt.subplots()
115
- # ax = sns.heatmap(user_heatmap)
116
- # st.pyplot(fig)
117
-
118
- # Most Busy Users
119
- if selected_user == 'Overall':
120
- st.title('Most Busy Users')
121
- x, new_df = helper.most_busy_users(df)
122
- fig, ax = plt.subplots()
123
-
124
- col1, col2 = st.columns(2)
125
-
126
- with col1:
127
- ax.bar(x.index, x.values, color='red')
128
- plt.xticks(rotation='vertical')
129
- st.pyplot(fig)
130
- with col2:
131
- st.dataframe(new_df)
132
-
133
- # WordCloud
134
- st.title("Wordcloud")
135
- df_wc = helper.create_wordcloud(selected_user, df)
136
- fig, ax = plt.subplots()
137
- ax.imshow(df_wc)
138
- st.pyplot(fig)
139
-
140
- # Most Common Words
141
- st.title('Most Common Words')
142
- most_common_df = helper.most_common_words(selected_user, df)
143
-
144
- # Filter emojis to prevent matplotlib warnings
145
- most_common_df[0] = most_common_df[0].apply(helper.remove_emojis)
146
-
147
- fig, ax = plt.subplots()
148
- ax.barh(most_common_df[0], most_common_df[1])
149
- plt.xticks(rotation='vertical')
150
- st.pyplot(fig)
151
-
152
- # Emoji Analysis
153
- st.title("Emoji Analysis")
154
- emoji_df = helper.emoji_helper(selected_user, df)
155
- col1, col2 = st.columns(2)
156
-
157
- with col1:
158
- st.dataframe(emoji_df)
159
- with col2:
160
- if not emoji_df.empty:
161
- fig, ax = plt.subplots()
162
- ax.pie(emoji_df[1].head(), labels=emoji_df[0].head(), autopct="%0.2f")
163
- st.pyplot(fig)
164
- else:
165
- st.write("No emojis found")
166
-
167
- # --- Deep Analysis Section (Lazy Loaded) ---
168
- st.markdown("---")
169
- st.header("πŸ€– Deep AI Analysis")
170
- st.info("Analyzing Sentiment and Topics... (This may take a few seconds for large files)")
171
-
172
- with st.spinner("Running AI models..."):
173
- # Filter df based on selected user first
174
- if selected_user != 'Overall':
175
- df_to_analyze = df[df['user'] == selected_user]
176
- else:
177
- df_to_analyze = df
178
-
179
- # Check if enough data
180
- if len(df_to_analyze) < 10:
181
- st.warning("Not enough data for deep analysis.")
182
  else:
183
- # Run Analysis
184
- @st.cache_data
185
- def run_deep_analysis(data_frame):
186
- return preprocessor.analyze_sentiment_and_topics(data_frame)
187
-
188
- analyzed_df, topics = run_deep_analysis(df_to_analyze)
189
-
190
- # Sentiment Analysis Visualization
191
- st.title("Sentiment Analysis")
192
- sentiment_counts = analyzed_df['sentiment'].value_counts()
193
-
194
- col1, col2 = st.columns(2)
195
- with col1:
196
- st.dataframe(sentiment_counts)
197
- with col2:
198
- fig, ax = plt.subplots()
199
- ax.pie(sentiment_counts, labels=sentiment_counts.index, autopct='%1.1f%%', startangle=90, colors=['#66b3ff','#99ff99','#ffcc99'])
200
- ax.axis('equal')
201
- st.pyplot(fig)
 
 
 
202
 
203
- # Sentiment by Month
204
- st.write("### Sentiment Count by Month")
205
- # Convert month names to abbreviated format
206
- month_map = {
207
- 'January': 'Jan', 'February': 'Feb', 'March': 'Mar', 'April': 'Apr',
208
- 'May': 'May', 'June': 'Jun', 'July': 'Jul', 'August': 'Aug',
209
- 'September': 'Sep', 'October': 'Oct', 'November': 'Nov', 'December': 'Dec'
210
- }
211
- analyzed_df['month'] = analyzed_df['month'].map(month_map)
212
- monthly_sentiment = analyzed_df.groupby(['month', 'sentiment']).size().unstack(fill_value=0)
213
-
214
- fig, axes = plt.subplots(1, 3, figsize=(18, 5))
215
- # Plot Positive Sentiment
216
- if 'positive' in monthly_sentiment.columns:
217
- axes[0].bar(monthly_sentiment.index, monthly_sentiment['positive'], color='green')
218
- axes[0].set_title('Positive Sentiment')
219
-
220
- # Plot Neutral Sentiment
221
- if 'neutral' in monthly_sentiment.columns:
222
- axes[1].bar(monthly_sentiment.index, monthly_sentiment['neutral'], color='blue')
223
- axes[1].set_title('Neutral Sentiment')
224
-
225
- # Plot Negative Sentiment
226
- if 'negative' in monthly_sentiment.columns:
227
- axes[2].bar(monthly_sentiment.index, monthly_sentiment['negative'], color='red')
228
- axes[2].set_title('Negative Sentiment')
229
-
230
- st.pyplot(fig)
231
-
232
- # Topic Analysis Visualization
233
- if len(topics) > 0:
234
- st.title("Topic Analysis")
235
- fig = helper.plot_topics(topics)
236
- st.pyplot(fig)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
237
 
238
- # Prepare data for new title generation
239
- topic_messages_map = {}
240
- for topic_id in analyzed_df['topic'].unique():
241
- # Get all messages for this topic
242
- msgs = analyzed_df[analyzed_df['topic'] == topic_id]['message'].tolist()
243
- # Select a sample (e.g., top 10 or random 10) to send to AI
244
- # Here we take the first 10 for simplicity, or random could be better
245
- topic_messages_map[topic_id] = msgs[:10]
246
-
247
- # Generate titles from messages
248
- with st.spinner("Generating descriptive topic titles..."):
249
- topic_titles_map = helper.generate_topic_titles_from_messages(topic_messages_map)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
250
 
251
- # Convert map to list for plotting compatibility (or update plotting to take dict)
252
- # plot_topics expects a list corresponding to the topics list order
253
- # The topics list from LDA is ordered by topic index 0, 1, 2...
254
- custom_titles = [topic_titles_map.get(i, f"Topic {i}") for i in range(len(topics))]
255
 
256
- st.title("Topic Analysis")
257
- fig = helper.plot_topics(topics, custom_titles=custom_titles)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
258
  st.pyplot(fig)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
259
 
 
 
 
 
 
 
 
 
 
260
  # Display Sample Messages for Each Topic
261
  st.header("Sample Messages for Each Topic")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
262
 
263
- for idx, topic_id in enumerate(analyzed_df['topic'].unique()):
264
- title = topic_titles_map.get(topic_id, f"Topic {topic_id}")
265
- st.subheader(title)
266
- filtered_messages = analyzed_df[analyzed_df['topic'] == topic_id]['message']
267
- for msg in filtered_messages.head(5):
268
- st.text(f"- {msg}")
269
- else:
270
- st.warning("No topics found for visualization.")
271
-
272
- # Clustering Analysis
273
- st.title("Clustering Analysis")
274
- n_clusters = st.slider("Select Number of Clusters", min_value=2, max_value=10, value=5)
275
-
276
- # Perform clustering on analyzed_df (which has lemmatized_message)
277
- clustered_df, reduced_features, _ = preprocessor.preprocess_for_clustering(analyzed_df, n_clusters=n_clusters)
278
-
279
- st.header("Cluster Visualization")
280
- fig = helper.plot_clusters(reduced_features, clustered_df['cluster'])
281
- st.pyplot(fig)
282
-
283
- # Cluster Insights
284
- st.header("Insights from Clusters")
285
-
286
- st.subheader("1. Dominant Conversation Themes")
287
- cluster_labels = helper.get_cluster_labels(clustered_df, n_clusters)
288
- for cluster_id, label in cluster_labels.items():
289
- st.write(f"**Cluster {cluster_id}**: {label}")
290
-
291
- st.subheader("2. Actionable Recommendations")
292
- recommendations = helper.generate_recommendations(clustered_df)
293
- for recommendation in recommendations:
294
- st.write(f"- {recommendation}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import streamlit as st
2
+ st.set_page_config(page_title="WhatsApp Chat Analyzer", layout="wide")
3
+
4
  import pandas as pd
5
  import matplotlib.pyplot as plt
 
 
 
 
6
  import seaborn as sns
7
  import preprocessor, helper
8
+ from sentiment import predict_sentiment_batch
9
+ import os
10
+ os.environ["STREAMLIT_SERVER_RUN_ON_SAVE"] = "false"
11
 
12
  # Theme customization
 
13
  st.markdown(
14
  """
15
  <style>
 
19
  unsafe_allow_html=True
20
  )
21
 
22
+ # Set seaborn style
23
+ sns.set_theme(style="whitegrid")
24
+
25
  st.title("πŸ“Š WhatsApp Chat Sentiment Analysis Dashboard")
26
  st.subheader('Instructions')
27
+ st.markdown("1. Open the sidebar and upload your WhatsApp chat file in .txt format.")
28
+ st.markdown("2. Wait for the initial processing (minimal delay).")
29
+ st.markdown("3. Customize the analysis by selecting users or filters.")
30
+ st.markdown("4. Click 'Show Analysis' for detailed results.")
31
 
32
  st.sidebar.title("Whatsapp Chat Analyzer")
33
+ uploaded_file = st.sidebar.file_uploader("Upload your chat file (.txt)", type="txt")
34
 
35
+ @st.cache_data
36
+ def load_and_preprocess(file_content):
37
+ return preprocessor.preprocess(file_content)
38
 
 
39
  if uploaded_file is not None:
40
  raw_data = uploaded_file.read().decode("utf-8")
41
+ with st.spinner("Loading chat data..."):
42
+ df, _ = load_and_preprocess(raw_data)
43
+ st.session_state.df = df
 
 
 
 
44
 
 
45
  st.sidebar.header("πŸ” Filters")
46
+ user_list = ["Overall"] + sorted(df["user"].unique().tolist())
47
+ selected_user = st.sidebar.selectbox("Select User", user_list)
 
 
 
48
 
49
+ df_filtered = df if selected_user == "Overall" else df[df["user"] == selected_user]
50
 
51
  if st.sidebar.button("Show Analysis"):
52
+ if df_filtered.empty:
53
+ st.warning(f"No data found for user: {selected_user}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  else:
55
+ with st.spinner("Analyzing..."):
56
+ if 'sentiment' not in df_filtered.columns:
57
+ try:
58
+ print("Starting sentiment analysis...")
59
+ # Get messages as clean strings
60
+ message_list = df_filtered["message"].astype(str).tolist()
61
+ message_list = [msg for msg in message_list if msg.strip()]
62
+
63
+ print(f"Processing {len(message_list)} messages")
64
+ print(f"Sample messages: {message_list[:5]}")
65
+
66
+ # Directly call the sentiment analysis function
67
+ df_filtered['sentiment'] = predict_sentiment_batch(message_list)
68
+ print("Sentiment analysis completed successfully")
69
+
70
+ except Exception as e:
71
+ st.error(f"Sentiment analysis failed: {str(e)}")
72
+ print(f"Full error: {str(e)}")
73
+
74
+ st.session_state.df_filtered = df_filtered
75
+ else:
76
+ st.session_state.df_filtered = df_filtered
77
 
78
+ # Display statistics and visualizations
79
+ num_messages, words, num_media, num_links = helper.fetch_stats(selected_user, df_filtered)
80
+ st.title("Top Statistics")
81
+ col1, col2, col3, col4 = st.columns(4)
82
+ with col1:
83
+ st.header("Total Messages")
84
+ st.title(num_messages)
85
+ with col2:
86
+ st.header("Total Words")
87
+ st.title(words)
88
+ with col3:
89
+ st.header("Media Shared")
90
+ st.title(num_media)
91
+ with col4:
92
+ st.header("Links Shared")
93
+ st.title(num_links)
94
+
95
+ st.title("Monthly Timeline")
96
+ timeline = helper.monthly_timeline(selected_user, df_filtered.sample(min(5000, len(df_filtered))))
97
+ if not timeline.empty:
98
+ plt.figure(figsize=(10, 5))
99
+ sns.lineplot(data=timeline, x='time', y='message', color='green')
100
+ plt.title("Monthly Timeline")
101
+ plt.xlabel("Date")
102
+ plt.ylabel("Messages")
103
+ st.pyplot(plt)
104
+ plt.clf()
105
+
106
+ st.title("Daily Timeline")
107
+ daily_timeline = helper.daily_timeline(selected_user, df_filtered.sample(min(5000, len(df_filtered))))
108
+ if not daily_timeline.empty:
109
+ plt.figure(figsize=(10, 5))
110
+ sns.lineplot(data=daily_timeline, x='date', y='message', color='black')
111
+ plt.title("Daily Timeline")
112
+ plt.xlabel("Date")
113
+ plt.ylabel("Messages")
114
+ st.pyplot(plt)
115
+ plt.clf()
116
+
117
+ st.title("Activity Map")
118
+ col1, col2 = st.columns(2)
119
+ with col1:
120
+ st.header("Most Busy Day")
121
+ busy_day = helper.week_activity_map(selected_user, df_filtered)
122
+ if not busy_day.empty:
123
+ plt.figure(figsize=(10, 5))
124
+ sns.barplot(x=busy_day.index, y=busy_day.values, palette="Purples_r")
125
+ plt.title("Most Busy Day")
126
+ plt.xlabel("Day of Week")
127
+ plt.ylabel("Message Count")
128
+ st.pyplot(plt)
129
+ plt.clf()
130
+ with col2:
131
+ st.header("Most Busy Month")
132
+ busy_month = helper.month_activity_map(selected_user, df_filtered)
133
+ if not busy_month.empty:
134
+ plt.figure(figsize=(10, 5))
135
+ sns.barplot(x=busy_month.index, y=busy_month.values, palette="Oranges_r")
136
+ plt.title("Most Busy Month")
137
+ plt.xlabel("Month")
138
+ plt.ylabel("Message Count")
139
+ st.pyplot(plt)
140
+ plt.clf()
141
+
142
+ if selected_user == 'Overall':
143
+ st.title("Most Busy Users")
144
+ x, new_df = helper.most_busy_users(df_filtered)
145
+ if not x.empty:
146
+ plt.figure(figsize=(10, 5))
147
+ sns.barplot(x=x.index, y=x.values, palette="Reds_r")
148
+ plt.title("Most Busy Users")
149
+ plt.xlabel("User")
150
+ plt.ylabel("Message Count")
151
+ plt.xticks(rotation=45)
152
+ st.pyplot(plt)
153
+ st.title("Word Count by User")
154
+ plt.clf()
155
+ st.dataframe(new_df)
156
 
157
+ # Most common words analysis
158
+ st.title("Most Common Words")
159
+ most_common_df = helper.most_common_words(selected_user, df_filtered)
160
+ if not most_common_df.empty:
161
+ fig, ax = plt.subplots(figsize=(10, 6))
162
+ sns.barplot(y=most_common_df[0], x=most_common_df[1], ax=ax, palette="Blues_r")
163
+ ax.set_title("Top 20 Most Common Words")
164
+ ax.set_xlabel("Frequency")
165
+ ax.set_ylabel("Words")
166
+ plt.xticks(rotation='vertical')
167
+ st.pyplot(fig)
168
+ plt.clf()
169
+ else:
170
+ st.warning("No data available for most common words.")
171
+
172
+ # Emoji analysis
173
+ st.title("Emoji Analysis")
174
+ emoji_df = helper.emoji_helper(selected_user, df_filtered)
175
+ if not emoji_df.empty:
176
+ col1, col2 = st.columns(2)
177
+
178
+ with col1:
179
+ st.subheader("Top Emojis Used")
180
+ st.dataframe(emoji_df)
181
+
182
+ with col2:
183
+ fig, ax = plt.subplots(figsize=(8, 8))
184
+ ax.pie(emoji_df[1].head(), labels=emoji_df[0].head(),
185
+ autopct="%0.2f%%", startangle=90,
186
+ colors=sns.color_palette("pastel"))
187
+ ax.set_title("Top Emoji Distribution")
188
+ st.pyplot(fig)
189
+ plt.clf()
190
+ else:
191
+ st.warning("No data available for emoji analysis.")
192
 
193
+ # Sentiment Analysis Visualizations
194
+ st.title("πŸ“ˆ Sentiment Analysis")
 
 
195
 
196
+ # Convert month names to abbreviated format
197
+ month_map = {
198
+ 'January': 'Jan', 'February': 'Feb', 'March': 'Mar', 'April': 'Apr',
199
+ 'May': 'May', 'June': 'Jun', 'July': 'Jul', 'August': 'Aug',
200
+ 'September': 'Sep', 'October': 'Oct', 'November': 'Nov', 'December': 'Dec'
201
+ }
202
+ df_filtered['month'] = df_filtered['month'].map(month_map)
203
+
204
+ # Group by month and sentiment
205
+ monthly_sentiment = df_filtered.groupby(['month', 'sentiment']).size().unstack(fill_value=0)
206
+
207
+ # Plotting: Histogram (Bar Chart) for each sentiment
208
+ st.write("### Sentiment Count by Month (Histogram)")
209
+
210
+ # Create a figure with subplots for each sentiment
211
+ fig, axes = plt.subplots(1, 3, figsize=(18, 5))
212
+
213
+ # Plot Positive Sentiment
214
+ if 'positive' in monthly_sentiment:
215
+ axes[0].bar(monthly_sentiment.index, monthly_sentiment['positive'], color='green')
216
+ axes[0].set_title('Positive Sentiment')
217
+ axes[0].set_xlabel('Month')
218
+ axes[0].set_ylabel('Count')
219
+
220
+ # Plot Neutral Sentiment
221
+ if 'neutral' in monthly_sentiment:
222
+ axes[1].bar(monthly_sentiment.index, monthly_sentiment['neutral'], color='blue')
223
+ axes[1].set_title('Neutral Sentiment')
224
+ axes[1].set_xlabel('Month')
225
+ axes[1].set_ylabel('Count')
226
+
227
+ # Plot Negative Sentiment
228
+ if 'negative' in monthly_sentiment:
229
+ axes[2].bar(monthly_sentiment.index, monthly_sentiment['negative'], color='red')
230
+ axes[2].set_title('Negative Sentiment')
231
+ axes[2].set_xlabel('Month')
232
+ axes[2].set_ylabel('Count')
233
+
234
+ # Display the plots in Streamlit
235
  st.pyplot(fig)
236
+ plt.clf()
237
+
238
+ # Count sentiments per day of the week
239
+ sentiment_counts = df_filtered.groupby(['day_of_week', 'sentiment']).size().unstack(fill_value=0)
240
+
241
+ # Sort days correctly
242
+ day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
243
+ sentiment_counts = sentiment_counts.reindex(day_order)
244
+
245
+ # Daily Sentiment Analysis
246
+ st.write("### Daily Sentiment Analysis")
247
+
248
+ # Create a Matplotlib figure
249
+ fig, ax = plt.subplots(figsize=(10, 5))
250
+ sentiment_counts.plot(kind='bar', stacked=False, ax=ax, color=['red', 'blue', 'green'])
251
+
252
+ # Customize the plot
253
+ ax.set_xlabel("Day of the Week")
254
+ ax.set_ylabel("Count")
255
+ ax.set_title("Sentiment Distribution per Day of the Week")
256
+ ax.legend(title="Sentiment")
257
+
258
+ # Display the plot in Streamlit
259
+ st.pyplot(fig)
260
+ plt.clf()
261
+
262
+ # Count messages per user per sentiment (only for Overall view)
263
+ if selected_user == 'Overall':
264
+ sentiment_counts = df_filtered.groupby(['user', 'sentiment']).size().reset_index(name='Count')
265
+
266
+ # Calculate total messages per sentiment
267
+ total_per_sentiment = df_filtered['sentiment'].value_counts().to_dict()
268
+
269
+ # Add percentage column
270
+ sentiment_counts['Percentage'] = sentiment_counts.apply(
271
+ lambda row: (row['Count'] / total_per_sentiment[row['sentiment']]) * 100, axis=1
272
+ )
273
+
274
+ # Separate tables for each sentiment
275
+ positive_df = sentiment_counts[sentiment_counts['sentiment'] == 'positive'].sort_values(by='Count', ascending=False).head(10)
276
+ neutral_df = sentiment_counts[sentiment_counts['sentiment'] == 'neutral'].sort_values(by='Count', ascending=False).head(10)
277
+ negative_df = sentiment_counts[sentiment_counts['sentiment'] == 'negative'].sort_values(by='Count', ascending=False).head(10)
278
+
279
+ # Sentiment Contribution Analysis
280
+ st.write("### Sentiment Contribution by User")
281
+
282
+ # Create three columns for side-by-side display
283
+ col1, col2, col3 = st.columns(3)
284
+
285
+ # Display Positive Table
286
+ with col1:
287
+ st.subheader("Top Positive Contributors")
288
+ if not positive_df.empty:
289
+ st.dataframe(positive_df[['user', 'Count', 'Percentage']])
290
+ else:
291
+ st.warning("No positive sentiment data")
292
+
293
+ # Display Neutral Table
294
+ with col2:
295
+ st.subheader("Top Neutral Contributors")
296
+ if not neutral_df.empty:
297
+ st.dataframe(neutral_df[['user', 'Count', 'Percentage']])
298
+ else:
299
+ st.warning("No neutral sentiment data")
300
+
301
+ # Display Negative Table
302
+ with col3:
303
+ st.subheader("Top Negative Contributors")
304
+ if not negative_df.empty:
305
+ st.dataframe(negative_df[['user', 'Count', 'Percentage']])
306
+ else:
307
+ st.warning("No negative sentiment data")
308
+
309
+ # Topic Analysis Section
310
+ st.title("πŸ” Area of Focus: Topic Analysis")
311
+
312
+ # Check if topic column exists, otherwise perform topic modeling
313
+ # if 'topic' not in df_filtered.columns:
314
+ # with st.spinner("Performing topic modeling..."):
315
+ # try:
316
+ # # Add topic modeling here or ensure your helper functions handle it
317
+ # df_filtered = helper.perform_topic_modeling(df_filtered)
318
+ # except Exception as e:
319
+ # st.error(f"Topic modeling failed: {str(e)}")
320
+ # st.stop()
321
 
322
+ # Plot Topic Distribution
323
+ st.header("Topic Distribution")
324
+ try:
325
+ fig = helper.plot_topic_distribution(df_filtered)
326
+ st.pyplot(fig)
327
+ plt.clf()
328
+ except Exception as e:
329
+ st.warning(f"Could not display topic distribution: {str(e)}")
330
+
331
  # Display Sample Messages for Each Topic
332
  st.header("Sample Messages for Each Topic")
333
+ if 'topic' in df_filtered.columns:
334
+ for topic_id in sorted(df_filtered['topic'].unique()):
335
+ st.subheader(f"Topic {topic_id}")
336
+
337
+ # Get messages for the current topic
338
+ filtered_messages = df_filtered[df_filtered['topic'] == topic_id]['message']
339
+
340
+ # Determine sample size
341
+ sample_size = min(5, len(filtered_messages))
342
+
343
+ if sample_size > 0:
344
+ sample_messages = filtered_messages.sample(sample_size, replace=False).tolist()
345
+ for msg in sample_messages:
346
+ st.write(f"- {msg}")
347
+ else:
348
+ st.write("No messages available for this topic.")
349
+ else:
350
+ st.warning("Topic information not available")
351
+
352
+ # Topic Distribution Over Time
353
+ st.header("πŸ“… Topic Trends Over Time")
354
+
355
+ # Add time frequency selector
356
+ time_freq = st.selectbox("Select Time Frequency", ["Daily", "Weekly", "Monthly"], key='time_freq')
357
+
358
+ # Plot topic trends
359
+ try:
360
+ freq_map = {"Daily": "D", "Weekly": "W", "Monthly": "M"}
361
+ topic_distribution = helper.topic_distribution_over_time(df_filtered, time_freq=freq_map[time_freq])
362
+
363
+ # Choose between static and interactive plot
364
+ use_plotly = st.checkbox("Use interactive visualization", value=True, key='use_plotly')
365
+
366
+ if use_plotly:
367
+ fig = helper.plot_topic_distribution_over_time_plotly(topic_distribution)
368
+ st.plotly_chart(fig, use_container_width=True)
369
+ else:
370
+ fig = helper.plot_topic_distribution_over_time(topic_distribution)
371
+ st.pyplot(fig)
372
+ plt.clf()
373
+ except Exception as e:
374
+ st.warning(f"Could not display topic trends: {str(e)}")
375
+
376
+ # Clustering Analysis Section
377
+ st.title("🧩 Conversation Clusters")
378
+
379
+ # Number of clusters input
380
+ n_clusters = st.slider("Select number of clusters",
381
+ min_value=2,
382
+ max_value=10,
383
+ value=5,
384
+ key='n_clusters')
385
 
386
+ # Perform clustering
387
+ with st.spinner("Analyzing conversation clusters..."):
388
+ try:
389
+ df_clustered, reduced_features, _ = preprocessor.preprocess_for_clustering(df_filtered, n_clusters=n_clusters)
390
+
391
+ # Plot clusters
392
+ st.header("Cluster Visualization")
393
+ fig = helper.plot_clusters(reduced_features, df_clustered['cluster'])
394
+ st.pyplot(fig)
395
+ plt.clf()
396
+
397
+ # Cluster Insights
398
+ st.header("πŸ“Œ Cluster Insights")
399
+
400
+ # 1. Dominant Conversation Themes
401
+ st.subheader("1. Dominant Themes")
402
+ cluster_labels = helper.get_cluster_labels(df_clustered, n_clusters)
403
+ for cluster_id, label in cluster_labels.items():
404
+ st.write(f"**Cluster {cluster_id}**: {label}")
405
+
406
+ # 2. Temporal Patterns
407
+ st.subheader("2. Temporal Patterns")
408
+ temporal_trends = helper.get_temporal_trends(df_clustered)
409
+ for cluster_id, trend in temporal_trends.items():
410
+ st.write(f"**Cluster {cluster_id}**: Peaks on {trend['peak_day']} around {trend['peak_time']}")
411
+
412
+ # 3. User Contributions
413
+ if selected_user == 'Overall':
414
+ st.subheader("3. Top Contributors")
415
+ user_contributions = helper.get_user_contributions(df_clustered)
416
+ for cluster_id, users in user_contributions.items():
417
+ st.write(f"**Cluster {cluster_id}**: {', '.join(users[:3])}...")
418
+
419
+ # 4. Sentiment by Cluster
420
+ st.subheader("4. Sentiment Analysis")
421
+ sentiment_by_cluster = helper.get_sentiment_by_cluster(df_clustered)
422
+ for cluster_id, sentiment in sentiment_by_cluster.items():
423
+ st.write(f"**Cluster {cluster_id}**: {sentiment['positive']}% positive, {sentiment['neutral']}% neutral, {sentiment['negative']}% negative")
424
+
425
+ # Sample messages from each cluster
426
+ st.subheader("Sample Messages")
427
+ for cluster_id in sorted(df_clustered['cluster'].unique()):
428
+ with st.expander(f"Cluster {cluster_id} Messages"):
429
+ cluster_msgs = df_clustered[df_clustered['cluster'] == cluster_id]['message']
430
+ sample_size = min(3, len(cluster_msgs))
431
+ if sample_size > 0:
432
+ for msg in cluster_msgs.sample(sample_size, replace=False):
433
+ st.write(f"- {msg}")
434
+ else:
435
+ st.write("No messages available")
436
+
437
+ except Exception as e:
438
+ st.error(f"Clustering failed: {str(e)}")
debug_regex.py DELETED
@@ -1,23 +0,0 @@
1
- import pandas as pd
2
- import re
3
-
4
- pattern = r"^(?P<Date>\d{1,2}/\d{1,2}/\d{2,4}),\s+(?P<Time>[\d:]+(?:\S*\s?[AP]M)?)\s+-\s+(?:(?P<Sender>.*?):\s+)?(?P<Message>.*)$"
5
-
6
- lines = [
7
- "12/12/23, 10:00 - User1: Hello",
8
- "1/1/23, 1:00 - User2: Hi",
9
- "10/10/2023, 10:00 PM - User3: Test",
10
- "12/12/23, 10:00 - System Message"
11
- ]
12
-
13
- df = pd.DataFrame({'line': lines})
14
- extracted = df['line'].str.extract(pattern)
15
- print("Extracted DataFrame:")
16
- print(extracted)
17
-
18
- print("\nRegex Match Check:")
19
- for line in lines:
20
- match = re.match(pattern, line)
21
- print(f"'{line}' -> Match: {bool(match)}")
22
- if match:
23
- print(match.groupdict())
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
finetune.py DELETED
@@ -1,102 +0,0 @@
1
- # Ensure you've run: pip install transformers datasets torch numpy tf-keras
2
- # PyTorch should already be installed (2.4.0 CPU version is fine)
3
-
4
- import torch
5
- from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments, TextClassificationPipeline
6
- from datasets import load_dataset
7
- import numpy as np
8
-
9
- # Check device: Use MPS if available (Apple Silicon), else CPU
10
- device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
11
- print(f"Using device: {device}")
12
-
13
- # Step 1: Load the pre-trained model and tokenizer
14
- model_name = "lxyuan/distilbert-base-multilingual-cased-sentiments-student"
15
- tokenizer = AutoTokenizer.from_pretrained(model_name)
16
- model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device)
17
-
18
- # Step 2: Load and prepare the tweet_eval sentiment dataset
19
- dataset = load_dataset("tweet_eval", "sentiment")
20
-
21
- # Remap labels: tweet_eval (0=negative, 1=neutral, 2=positive) to our model (0=positive, 1=neutral, 2=negative)
22
- def remap_labels(example):
23
- label_map = {0: 2, 1: 1, 2: 0} # Negative->2, Neutral->1, Positive->0
24
- example["label"] = label_map[example["label"]]
25
- return example
26
-
27
- dataset = dataset.map(remap_labels)
28
-
29
- # Tokenize the dataset
30
- def tokenize_function(examples):
31
- return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)
32
-
33
- tokenized_dataset = dataset.map(tokenize_function, batched=True)
34
- tokenized_dataset = tokenized_dataset.remove_columns(["text"])
35
- tokenized_dataset = tokenized_dataset.rename_column("label", "labels")
36
- tokenized_dataset.set_format("torch")
37
-
38
- # Split into train and eval datasets
39
- train_dataset = tokenized_dataset["train"] # ~45,580 examples
40
- eval_dataset = tokenized_dataset["test"] # ~12,000 examples
41
-
42
- # Step 3: Define a function to compute accuracy
43
- def compute_metrics(eval_pred):
44
- logits, labels = eval_pred
45
- predictions = np.argmax(logits, axis=-1)
46
- accuracy = (predictions == labels).mean()
47
- return {"accuracy": accuracy}
48
-
49
- # Step 4: Set up training arguments
50
- training_args = TrainingArguments(
51
- output_dir="./fine-tuned-sentiment-large",
52
- num_train_epochs=3,
53
- per_device_train_batch_size=4, # Reduced for 8GB RAM
54
- per_device_eval_batch_size=4, # Reduced for 8GB RAM
55
- warmup_steps=500,
56
- weight_decay=0.01,
57
- logging_dir="./logs",
58
- logging_steps=100,
59
- eval_strategy="epoch", # Updated from evaluation_strategy
60
- save_strategy="epoch",
61
- learning_rate=2e-5,
62
- fp16=False, # Disabled (not supported on MPS)
63
- # Use MPS acceleration if available
64
- no_cuda=True, # Force no CUDA since M2 doesn't support it
65
- # torch.backends.mps.is_available() check is handled by device selection
66
- )
67
-
68
- # Step 5: Initialize and train the model
69
- trainer = Trainer(
70
- model=model,
71
- args=training_args,
72
- train_dataset=train_dataset,
73
- eval_dataset=eval_dataset,
74
- compute_metrics=compute_metrics,
75
- )
76
-
77
- print("Starting training...")
78
- trainer.train()
79
-
80
- # Step 6: Save the fine-tuned model
81
- model.save_pretrained("./fine-tuned-sentiment-large")
82
- tokenizer.save_pretrained("./fine-tuned-sentiment-large")
83
- print("Model saved to ./fine-tuned-sentiment-large")
84
-
85
- # Step 7: Evaluate the model on the test set
86
- eval_results = trainer.evaluate()
87
- print(f"Evaluation results: {eval_results}")
88
-
89
- # Step 8: Test on your specific examples
90
- classifier = TextClassificationPipeline(
91
- model=AutoModelForSequenceClassification.from_pretrained("./fine-tuned-sentiment-large").to(device),
92
- tokenizer=AutoTokenizer.from_pretrained("./fine-tuned-sentiment-large"),
93
- device=0 if device.type == "mps" else -1, # 0 for MPS, -1 for CPU
94
- return_all_scores=False
95
- )
96
-
97
- texts = ["Great service!", "It's okay, nothing special.", "Terrible experience."]
98
- results = classifier(texts)
99
-
100
- print("\nTesting on custom examples:")
101
- for text, result in zip(texts, results):
102
- print(f"Text: {text} -> Sentiment: {result['label']}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
helper.py CHANGED
@@ -3,243 +3,130 @@ from wordcloud import WordCloud
3
  import pandas as pd
4
  from collections import Counter
5
  import emoji
 
6
  import matplotlib.pyplot as plt
7
  import seaborn as sns
8
- import plotly.express as px
9
- import numpy as np
10
- from sklearn.feature_extraction.text import TfidfVectorizer
11
- from openrouter_chat import generate_title_from_messages
12
 
13
  extract = URLExtract()
14
 
15
- def fetch_stats(selected_user,df):
16
-
17
  if selected_user != 'Overall':
18
  df = df[df['user'] == selected_user]
19
-
20
- # fetch the number of messages
21
  num_messages = df.shape[0]
22
-
23
- # fetch the total number of words
24
- words = []
25
- for message in df['message']:
26
- words.extend(message.split())
27
-
28
- # fetch number of media messages
29
- num_media_messages = df[df['unfiltered_messages'].str.contains('<media omitted>', case=False, na=False)].shape[0]
30
-
31
- # fetch number of links shared
32
- links = []
33
- for message in df['unfiltered_messages']:
34
- links.extend(extract.find_urls(message))
35
-
36
- return num_messages,len(words),num_media_messages,len(links)
37
 
38
  def most_busy_users(df):
39
  x = df['user'].value_counts().head()
40
  df = round((df['user'].value_counts() / df.shape[0]) * 100, 2).reset_index().rename(
41
  columns={'index': 'percentage', 'user': 'Name'})
42
- return x,df
43
 
44
  def create_wordcloud(selected_user, df):
45
- # f = open('stop_hinglish.txt', 'r')
46
- stop_words = df
47
-
48
  if selected_user != 'Overall':
49
  df = df[df['user'] == selected_user]
50
-
51
  temp = df[df['user'] != 'group_notification']
52
  temp = temp[~temp['message'].str.lower().str.contains('<media omitted>')]
53
-
54
- def remove_stop_words(message):
55
- y = []
56
- for word in message.lower().split():
57
- if word not in stop_words:
58
- y.append(word)
59
- return " ".join(y)
60
-
61
  wc = WordCloud(width=500, height=500, min_font_size=10, background_color='white')
62
- temp['message'] = temp['message'].apply(remove_stop_words)
63
  df_wc = wc.generate(temp['message'].str.cat(sep=" "))
64
  return df_wc
65
 
66
  def most_common_words(selected_user, df):
67
- # f = open('stop_hinglish.txt','r')
68
- stop_words = df
69
-
70
  if selected_user != 'Overall':
71
  df = df[df['user'] == selected_user]
72
-
73
  temp = df[df['user'] != 'group_notification']
74
  temp = temp[~temp['message'].str.lower().str.contains('<media omitted>')]
75
-
76
- words = []
77
-
78
- for message in temp['message']:
79
- for word in message.lower().split():
80
- if word not in stop_words:
81
- words.append(word)
82
-
83
- most_common_df = pd.DataFrame(Counter(words).most_common(20))
84
- return most_common_df
85
 
86
  def emoji_helper(selected_user, df):
87
  if selected_user != 'Overall':
88
  df = df[df['user'] == selected_user]
 
 
89
 
90
- emojis = []
91
- for message in df['unfiltered_messages']:
92
- emojis.extend([c for c in message if c in emoji.EMOJI_DATA])
93
-
94
- emoji_df = pd.DataFrame(Counter(emojis).most_common(len(Counter(emojis))))
95
-
96
- if emoji_df.empty:
97
- return pd.DataFrame(columns=[0, 1])
98
-
99
- return emoji_df
100
-
101
-
102
- def monthly_timeline(selected_user,df):
103
-
104
  if selected_user != 'Overall':
105
  df = df[df['user'] == selected_user]
106
-
107
- timeline = df.groupby(['year','month']).count()['message'].reset_index()
108
-
109
- time = []
110
- for i in range(timeline.shape[0]):
111
- time.append(timeline['month'][i] + "-" + str(timeline['year'][i]))
112
-
113
- timeline['time'] = time
114
-
115
  return timeline
116
 
117
- def daily_timeline(selected_user,df):
118
-
119
  if selected_user != 'Overall':
120
  df = df[df['user'] == selected_user]
 
121
 
122
- daily_timeline = df.groupby('date').count()['message'].reset_index()
123
-
124
- return daily_timeline
125
-
126
- def week_activity_map(selected_user,df):
127
-
128
  if selected_user != 'Overall':
129
  df = df[df['user'] == selected_user]
 
130
 
131
- return df['day'].value_counts()
132
-
133
- def month_activity_map(selected_user,df):
134
-
135
  if selected_user != 'Overall':
136
  df = df[df['user'] == selected_user]
137
-
138
  return df['month'].value_counts()
139
 
140
- def activity_heatmap(selected_user,df):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
 
142
  if selected_user != 'Overall':
143
  df = df[df['user'] == selected_user]
144
 
145
- user_heatmap = df.pivot_table(index='day', columns='period', values='message', aggfunc='count').fillna(0)
 
146
 
147
- return user_heatmap
148
 
149
- def generate_wordcloud(text, color):
150
- wordcloud = WordCloud(width=400, height=300, background_color=color, colormap="viridis").generate(text)
151
- return wordcloud
 
152
 
153
- def create_heuristic_title(topic, idx):
154
- return f"Topic {idx + 1}: {', '.join(topic[:3])}"
155
 
156
- def generate_topic_titles_from_messages(topic_messages_map):
157
- """
158
- Generate titles for topics using OpenRouter AI based on message content.
159
-
160
- Args:
161
- topic_messages_map (dict): key=topic_id, value=list of message strings
162
-
163
- Returns:
164
- dict: key=topic_id, value=generated title
165
- """
166
- titles = {}
167
- print("Generating topic titles using OpenRouter...")
168
- for topic_id, messages in topic_messages_map.items():
169
- try:
170
- # Generate title from sample messages
171
- title = generate_title_from_messages(messages)
172
- titles[topic_id] = title
173
- print(f"Topic {topic_id}: {title}\n\n\n\n{messages}")
174
- except Exception as e:
175
- print(f"Failed to generate title for topic {topic_id}: {e}")
176
- titles[topic_id] = f"Topic {topic_id}"
177
-
178
- return titles
179
-
180
- def create_basic_titles(topics):
181
- """Fallback to keyword-based titles if AI fails or is unused."""
182
- titles = []
183
- for idx, topic_words in enumerate(topics):
184
- if isinstance(topic_words, list) and len(topic_words) >= 3:
185
- title = f"Topic {idx}: {', '.join(topic_words[:3])}"
186
- else:
187
- title = f"Topic {idx}"
188
- titles.append(title)
189
- return titles
190
 
191
- def plot_topics(topics, use_ai=True, **kwargs):
192
- """
193
- Plots a bar chart for the top words in each topic.
194
-
195
- Args:
196
- topics: List of topics (lists of top words)
197
- custom_titles: Optional list or dict of titles to use instead of generating them
198
-
199
- Returns:
200
- matplotlib.figure.Figure: The plot figure
201
- """
202
- if not topics or not isinstance(topics[0], list):
203
- raise ValueError("topics must be a list of lists of words.")
204
-
205
- # Determine titles
206
- custom_titles = kwargs.get('custom_titles')
207
- if custom_titles:
208
- # If it's a dict, convert to list based on index
209
- if isinstance(custom_titles, dict):
210
- titles = [custom_titles.get(i, f"Topic {i}") for i in range(len(topics))]
211
- else:
212
- titles = custom_titles
213
- else:
214
- # Fallback to basic keyword-based titles
215
- titles = create_basic_titles(topics)
216
-
217
- fig, axes = plt.subplots(1, len(topics), figsize=(20, 10))
218
- if len(topics) == 1:
219
- axes = [axes] # Ensure axes is iterable for single topic
220
-
221
- for idx, topic in enumerate(topics):
222
- if not isinstance(topic, list):
223
- raise ValueError(f"Topic {idx} is not a list of words.")
224
-
225
- top_words = topic[:10] # Show top 10 words
226
- axes[idx].barh(range(len(top_words)), range(len(top_words)))
227
- axes[idx].set_yticks(range(len(top_words)))
228
- axes[idx].set_yticklabels(top_words)
229
- axes[idx].set_title(titles[idx], fontsize=14, fontweight='bold')
230
- axes[idx].set_xlabel("Word Importance")
231
- axes[idx].set_ylabel("Top Words")
232
 
233
- plt.tight_layout()
234
- return fig
235
 
 
236
  def plot_topic_distribution(df):
237
  """
238
  Plots the distribution of topics in the chat data.
239
  """
240
  topic_counts = df['topic'].value_counts().sort_index()
241
  fig, ax = plt.subplots()
242
- sns.barplot(x=topic_counts.index, y=topic_counts.values, ax=ax, palette="viridis", hue=topic_counts.index, legend=False)
243
  ax.set_title("Topic Distribution")
244
  ax.set_xlabel("Topic")
245
  ax.set_ylabel("Number of Messages")
@@ -252,16 +139,6 @@ def most_frequent_keywords(messages, top_n=10):
252
  words = [word for msg in messages for word in msg.split()]
253
  word_freq = Counter(words)
254
  return word_freq.most_common(top_n)
255
-
256
- def topic_distribution_over_time(df, time_freq='M'):
257
- """
258
- Analyzes the distribution of topics over time.
259
- """
260
- # Group by time interval and topic
261
- df['time_period'] = df['date'].dt.to_period(time_freq)
262
- topic_distribution = df.groupby(['time_period', 'topic']).size().unstack(fill_value=0)
263
- return topic_distribution
264
-
265
  def plot_topic_distribution_over_time(topic_distribution):
266
  """
267
  Plots the distribution of topics over time using a line chart.
@@ -286,11 +163,37 @@ def plot_most_frequent_keywords(keywords):
286
  """
287
  words, counts = zip(*keywords)
288
  fig, ax = plt.subplots()
289
- sns.barplot(x=list(counts), y=list(words), ax=ax, palette="viridis", hue=list(words), legend=False)
290
  ax.set_title("Most Frequent Keywords")
291
  ax.set_xlabel("Frequency")
292
  ax.set_ylabel("Keyword")
293
  return fig
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
294
 
295
  def plot_topic_distribution_over_time_plotly(topic_distribution):
296
  """
@@ -304,7 +207,6 @@ def plot_topic_distribution_over_time_plotly(topic_distribution):
304
  title="Topic Distribution Over Time", labels={'time_period': 'Time Period', 'count': 'Number of Messages'})
305
  fig.update_layout(legend_title_text='Topics', xaxis_tickangle=-45)
306
  return fig
307
-
308
  def plot_clusters(reduced_features, clusters):
309
  """
310
  Visualize clusters using t-SNE.
@@ -327,25 +229,19 @@ def plot_clusters(reduced_features, clusters):
327
  plt.ylabel("t-SNE Component 2")
328
  plt.tight_layout()
329
  return plt.gcf()
330
-
331
- def remove_emojis(text):
332
- """Removes emojis from text to prevent matplotlib warnings."""
333
- return text.encode('ascii', 'ignore').decode('ascii')
334
-
335
  def get_cluster_labels(df, n_clusters):
336
  """
337
  Generate descriptive labels for each cluster based on top keywords.
338
  """
 
 
 
339
  vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
340
  tfidf_matrix = vectorizer.fit_transform(df['lemmatized_message'])
341
 
342
  cluster_labels = {}
343
- # Reset index to ensure alignment with tfidf_matrix
344
- df_reset = df.reset_index(drop=True)
345
-
346
  for cluster_id in range(n_clusters):
347
- # Get indices where cluster matches
348
- cluster_indices = df_reset[df_reset['cluster'] == cluster_id].index
349
  if len(cluster_indices) > 0:
350
  cluster_tfidf = tfidf_matrix[cluster_indices]
351
  top_keywords = np.argsort(cluster_tfidf.sum(axis=0).A1)[-3:][::-1]
 
3
  import pandas as pd
4
  from collections import Counter
5
  import emoji
6
+ import plotly.express as px
7
  import matplotlib.pyplot as plt
8
  import seaborn as sns
 
 
 
 
9
 
10
  extract = URLExtract()
11
 
12
+ def fetch_stats(selected_user, df):
 
13
  if selected_user != 'Overall':
14
  df = df[df['user'] == selected_user]
 
 
15
  num_messages = df.shape[0]
16
+ words = sum(len(msg.split()) for msg in df['message'])
17
+ num_media_messages = df[df['unfiltered_messages'] == '<media omitted>\n'].shape[0]
18
+ links = sum(len(extract.find_urls(msg)) for msg in df['unfiltered_messages'])
19
+ return num_messages, words, num_media_messages, links
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  def most_busy_users(df):
22
  x = df['user'].value_counts().head()
23
  df = round((df['user'].value_counts() / df.shape[0]) * 100, 2).reset_index().rename(
24
  columns={'index': 'percentage', 'user': 'Name'})
25
+ return x, df
26
 
27
  def create_wordcloud(selected_user, df):
 
 
 
28
  if selected_user != 'Overall':
29
  df = df[df['user'] == selected_user]
 
30
  temp = df[df['user'] != 'group_notification']
31
  temp = temp[~temp['message'].str.lower().str.contains('<media omitted>')]
 
 
 
 
 
 
 
 
32
  wc = WordCloud(width=500, height=500, min_font_size=10, background_color='white')
 
33
  df_wc = wc.generate(temp['message'].str.cat(sep=" "))
34
  return df_wc
35
 
36
  def most_common_words(selected_user, df):
 
 
 
37
  if selected_user != 'Overall':
38
  df = df[df['user'] == selected_user]
 
39
  temp = df[df['user'] != 'group_notification']
40
  temp = temp[~temp['message'].str.lower().str.contains('<media omitted>')]
41
+ words = [word for msg in temp['message'] for word in msg.lower().split()]
42
+ return pd.DataFrame(Counter(words).most_common(20))
 
 
 
 
 
 
 
 
43
 
44
  def emoji_helper(selected_user, df):
45
  if selected_user != 'Overall':
46
  df = df[df['user'] == selected_user]
47
+ emojis = [c for msg in df['unfiltered_messages'] for c in msg if c in emoji.EMOJI_DATA]
48
+ return pd.DataFrame(Counter(emojis).most_common(len(Counter(emojis))))
49
 
50
+ def monthly_timeline(selected_user, df):
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  if selected_user != 'Overall':
52
  df = df[df['user'] == selected_user]
53
+ timeline = df.groupby(['year', 'month']).count()['message'].reset_index()
54
+ timeline['time'] = timeline['month'] + "-" + timeline['year'].astype(str)
 
 
 
 
 
 
 
55
  return timeline
56
 
57
+ def daily_timeline(selected_user, df):
 
58
  if selected_user != 'Overall':
59
  df = df[df['user'] == selected_user]
60
+ return df.groupby('date').count()['message'].reset_index()
61
 
62
+ def week_activity_map(selected_user, df):
 
 
 
 
 
63
  if selected_user != 'Overall':
64
  df = df[df['user'] == selected_user]
65
+ return df['day_of_week'].value_counts()
66
 
67
+ def month_activity_map(selected_user, df):
 
 
 
68
  if selected_user != 'Overall':
69
  df = df[df['user'] == selected_user]
 
70
  return df['month'].value_counts()
71
 
72
+ def plot_topic_distribution(df):
73
+ topic_counts = df['topic'].value_counts().sort_index()
74
+ fig = px.bar(x=topic_counts.index, y=topic_counts.values, title="Topic Distribution", color_discrete_sequence=['viridis'])
75
+ return fig
76
+
77
+ def topic_distribution_over_time(df, time_freq='M'):
78
+ df['time_period'] = df['date'].dt.to_period(time_freq)
79
+ return df.groupby(['time_period', 'topic']).size().unstack(fill_value=0)
80
+
81
+ def plot_topic_distribution_over_time_plotly(topic_distribution):
82
+ topic_distribution = topic_distribution.reset_index()
83
+ topic_distribution['time_period'] = topic_distribution['time_period'].dt.to_timestamp()
84
+ topic_distribution = topic_distribution.melt(id_vars='time_period', var_name='topic', value_name='count')
85
+ fig = px.line(topic_distribution, x='time_period', y='count', color='topic', title="Topic Distribution Over Time")
86
+ fig.update_layout(legend_title_text='Topics', xaxis_tickangle=-45)
87
+ return fig
88
+
89
+ def plot_clusters(reduced_features, clusters):
90
+ fig = px.scatter(x=reduced_features[:, 0], y=reduced_features[:, 1], color=clusters, title="Message Clusters (t-SNE)")
91
+ return fig
92
+ def most_common_words(selected_user, df):
93
+ # f = open('stop_hinglish.txt','r')
94
+ stop_words = df
95
 
96
  if selected_user != 'Overall':
97
  df = df[df['user'] == selected_user]
98
 
99
+ temp = df[df['user'] != 'group_notification']
100
+ temp = temp[~temp['message'].str.lower().str.contains('<media omitted>')]
101
 
102
+ words = []
103
 
104
+ for message in temp['message']:
105
+ for word in message.lower().split():
106
+ if word not in stop_words:
107
+ words.append(word)
108
 
109
+ most_common_df = pd.DataFrame(Counter(words).most_common(20))
110
+ return most_common_df
111
 
112
+ def emoji_helper(selected_user, df):
113
+ if selected_user != 'Overall':
114
+ df = df[df['user'] == selected_user]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
 
116
+ emojis = []
117
+ for message in df['unfiltered_messages']:
118
+ emojis.extend([c for c in message if c in emoji.EMOJI_DATA])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
 
120
+ emoji_df = pd.DataFrame(Counter(emojis).most_common(len(Counter(emojis))))
 
121
 
122
+ return emoji_df
123
  def plot_topic_distribution(df):
124
  """
125
  Plots the distribution of topics in the chat data.
126
  """
127
  topic_counts = df['topic'].value_counts().sort_index()
128
  fig, ax = plt.subplots()
129
+ sns.barplot(x=topic_counts.index, y=topic_counts.values, ax=ax, palette="viridis")
130
  ax.set_title("Topic Distribution")
131
  ax.set_xlabel("Topic")
132
  ax.set_ylabel("Number of Messages")
 
139
  words = [word for msg in messages for word in msg.split()]
140
  word_freq = Counter(words)
141
  return word_freq.most_common(top_n)
 
 
 
 
 
 
 
 
 
 
142
  def plot_topic_distribution_over_time(topic_distribution):
143
  """
144
  Plots the distribution of topics over time using a line chart.
 
163
  """
164
  words, counts = zip(*keywords)
165
  fig, ax = plt.subplots()
166
+ sns.barplot(x=list(counts), y=list(words), ax=ax, palette="viridis")
167
  ax.set_title("Most Frequent Keywords")
168
  ax.set_xlabel("Frequency")
169
  ax.set_ylabel("Keyword")
170
  return fig
171
+ def topic_distribution_over_time(df, time_freq='M'):
172
+ """
173
+ Analyzes the distribution of topics over time.
174
+ """
175
+ # Group by time interval and topic
176
+ df['time_period'] = df['date'].dt.to_period(time_freq)
177
+ topic_distribution = df.groupby(['time_period', 'topic']).size().unstack(fill_value=0)
178
+ return topic_distribution
179
+
180
+ def plot_topic_distribution_over_time(topic_distribution):
181
+ """
182
+ Plots the distribution of topics over time using a line chart.
183
+ """
184
+ fig, ax = plt.subplots(figsize=(12, 6))
185
+
186
+ # Plot each topic as a separate line
187
+ for topic in topic_distribution.columns:
188
+ ax.plot(topic_distribution.index.to_timestamp(), topic_distribution[topic], label=f"Topic {topic}")
189
+
190
+ ax.set_title("Topic Distribution Over Time")
191
+ ax.set_xlabel("Time Period")
192
+ ax.set_ylabel("Number of Messages")
193
+ ax.legend(title="Topics", bbox_to_anchor=(1.05, 1), loc='upper left')
194
+ plt.xticks(rotation=45)
195
+ plt.tight_layout()
196
+ return fig
197
 
198
  def plot_topic_distribution_over_time_plotly(topic_distribution):
199
  """
 
207
  title="Topic Distribution Over Time", labels={'time_period': 'Time Period', 'count': 'Number of Messages'})
208
  fig.update_layout(legend_title_text='Topics', xaxis_tickangle=-45)
209
  return fig
 
210
  def plot_clusters(reduced_features, clusters):
211
  """
212
  Visualize clusters using t-SNE.
 
229
  plt.ylabel("t-SNE Component 2")
230
  plt.tight_layout()
231
  return plt.gcf()
 
 
 
 
 
232
  def get_cluster_labels(df, n_clusters):
233
  """
234
  Generate descriptive labels for each cluster based on top keywords.
235
  """
236
+ from sklearn.feature_extraction.text import TfidfVectorizer
237
+ import numpy as np
238
+
239
  vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
240
  tfidf_matrix = vectorizer.fit_transform(df['lemmatized_message'])
241
 
242
  cluster_labels = {}
 
 
 
243
  for cluster_id in range(n_clusters):
244
+ cluster_indices = df[df['cluster'] == cluster_id].index
 
245
  if len(cluster_indices) > 0:
246
  cluster_tfidf = tfidf_matrix[cluster_indices]
247
  top_keywords = np.argsort(cluster_tfidf.sum(axis=0).A1)[-3:][::-1]
naive_bayes_model.pkl DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:b4298e7bca558075d89911be8c06311d1276cfc414a445f6a5ed6561201985f1
3
- size 480879
 
 
 
 
openrouter_chat.py DELETED
@@ -1,91 +0,0 @@
1
- import os
2
- import requests
3
- from dotenv import load_dotenv
4
-
5
- # Load .env file
6
- load_dotenv()
7
-
8
- API_KEY = os.getenv("OPENROUTER_API_KEY")
9
-
10
- if not API_KEY:
11
- raise ValueError("OPENROUTER_API_KEY not found in .env file")
12
-
13
- API_URL = "https://openrouter.ai/api/v1/chat/completions"
14
-
15
- def get_chat_completion(messages, model="openai/gpt-4o-mini"):
16
- """
17
- Sends a list of messages to the OpenRouter API and returns the response content.
18
-
19
- Args:
20
- messages (list): A list of message dictionaries (e.g., [{"role": "user", "content": "..."}]).
21
- model (str): The model to use.
22
-
23
- Returns:
24
- str: The content of the AI's response.
25
- """
26
- headers = {
27
- "Authorization": f"Bearer {API_KEY}",
28
- "Content-Type": "application/json",
29
- "HTTP-Referer": "http://localhost",
30
- "X-Title": "WhatsApp Chat Analyzer"
31
- }
32
-
33
- payload = {
34
- "model": model,
35
- "messages": messages
36
- }
37
-
38
- try:
39
- response = requests.post(API_URL, headers=headers, json=payload)
40
- response.raise_for_status()
41
- data = response.json()
42
- return data["choices"][0]["message"]["content"]
43
- except Exception as e:
44
- print(f"Error calling OpenRouter API: {e}")
45
- return None
46
-
47
- def generate_title_from_messages(messages_list):
48
- """
49
- Generates a short, descriptive topic title based on a list of messages.
50
-
51
- Args:
52
- messages_list (list): A list of strings, where each string is a chat message.
53
-
54
- Returns:
55
- str: A generated title.
56
- """
57
- if not messages_list:
58
- return "Unknown Topic"
59
-
60
- # Limit to reasonable amount of text to avoid context limits or high costs
61
- # Join top messages with newlines
62
- context = "\n".join(messages_list[:10])
63
-
64
- prompt = (
65
- "Analyze the following WhatsApp chat messages and generate a SINGLE, short, descriptive title "
66
- "(max 5 words) that summarizes the conversation topic. Do not use quotes or prefixes like 'Topic:'. "
67
- "Just the title.\n\n"
68
- f"Messages:\n{context}"
69
- )
70
-
71
- messages = [
72
- {"role": "system", "content": "You are a helpful assistant that summarizes chat topics."},
73
- {"role": "user", "content": prompt}
74
- ]
75
-
76
- title = get_chat_completion(messages)
77
- print("Title:\n\n\n\n\n", title)
78
- return title.strip() if title else "General Discussion"
79
-
80
- if __name__ == "__main__":
81
- print("πŸ€– OpenRouter AI Chat (type 'exit' to quit)\n")
82
-
83
- while True:
84
- user_input = input("You: ")
85
- if user_input.lower() == "exit":
86
- break
87
-
88
- # Test the basic chat function
89
- msgs = [{"role": "user", "content": user_input}]
90
- reply = get_chat_completion(msgs)
91
- print("\nAI:", reply, "\n")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
preprocessor.py CHANGED
@@ -1,25 +1,73 @@
1
  import re
2
  import pandas as pd
3
- # from sentiment_train import predict_sentiment
4
- from sentiment import predict_sentiment_bert_batch
5
  import spacy
6
- from langdetect import detect, LangDetectException
7
- from sklearn.feature_extraction.text import CountVectorizer
8
  from sklearn.decomposition import LatentDirichletAllocation
9
  from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
10
  from spacy.lang.fr.stop_words import STOP_WORDS as FRENCH_STOP_WORDS
11
- from sklearn.feature_extraction.text import TfidfVectorizer
12
  from sklearn.cluster import KMeans
13
  from sklearn.manifold import TSNE
14
  import numpy as np
 
 
 
 
15
 
16
- # Load language models
17
- nlp_fr = spacy.load("fr_core_news_sm")
18
- nlp_en = spacy.load("en_core_web_sm")
19
 
20
- # Merge English and French stop words
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  custom_stop_words = list(ENGLISH_STOP_WORDS.union(FRENCH_STOP_WORDS))
22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  def lemmatize_text(text, lang):
24
  if lang == 'fr':
25
  doc = nlp_fr(text)
@@ -27,166 +75,94 @@ def lemmatize_text(text, lang):
27
  doc = nlp_en(text)
28
  return " ".join([token.lemma_ for token in doc if not token.is_punct])
29
 
30
- def clean_message(text):
31
- """ Remove media notifications, special characters, and unwanted symbols. """
32
- if not isinstance(text, str):
33
- return ""
34
- text = text.lower() # Convert to lowercase
35
- text = re.sub(r"<media omitted>", "", text) # Remove media notifications
36
- text = re.sub(r"this message was deleted", "", text)
37
- text = re.sub(r"null", "", text)
38
-
39
- text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE) # Remove links
40
- text = re.sub(r"[^a-zA-ZΓ€-ΓΏ0-9\s]", "", text) # Remove special characters
41
- return text
42
-
43
- from sklearn.feature_extraction.text import TfidfVectorizer
44
- from sklearn.cluster import KMeans
45
- from sklearn.manifold import TSNE
46
- import numpy as np
47
-
48
- def preprocess_for_clustering(df, n_clusters=5):
49
- """
50
- Preprocess messages for clustering.
51
- Args:
52
- df (pd.DataFrame): DataFrame containing the 'lemmatized_message' column.
53
- n_clusters (int): Number of clusters to create.
54
- Returns:
55
- df (pd.DataFrame): DataFrame with added 'cluster' column.
56
- cluster_centers (np.array): Cluster centroids.
57
- """
58
- # Step 1: Vectorize text using TF-IDF
59
- vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
60
- tfidf_matrix = vectorizer.fit_transform(df['lemmatized_message'])
61
-
62
- # Step 2: Apply K-Means clustering
63
- kmeans = KMeans(n_clusters=n_clusters, random_state=42)
64
- clusters = kmeans.fit_predict(tfidf_matrix)
65
-
66
- # Step 3: Add cluster labels to DataFrame
67
- df['cluster'] = clusters
68
-
69
- # Step 4: Reduce dimensionality for visualization
70
- tsne = TSNE(n_components=2, random_state=42)
71
- reduced_features = tsne.fit_transform(tfidf_matrix.toarray())
72
-
73
- return df, reduced_features, kmeans.cluster_centers_
74
-
75
- def parse_data(data):
76
- """
77
- Parses the raw chat data into a DataFrame and performs basic cleaning.
78
- """
79
- # Optimization: Use pandas vectorized string operations instead of looping
80
-
81
- # Split lines
82
- lines = data.strip().split("\n")
83
- df = pd.DataFrame({'line': lines})
84
-
85
- # Extract Date, Time, Sender, Message using regex
86
  pattern = r"^(?P<Date>\d{1,2}/\d{1,2}/\d{2,4}),\s+(?P<Time>[\d:]+(?:\S*\s?[AP]M)?)\s+-\s+(?:(?P<Sender>.*?):\s+)?(?P<Message>.*)$"
87
-
88
- extracted = df['line'].str.extract(pattern)
89
-
90
- # Drop lines that didn't match (if any)
91
- extracted = extracted.dropna(subset=['Date', 'Time', 'Message'])
92
-
93
- # Combine Date and Time
94
- extracted['Time'] = extracted['Time'].str.replace('Ò€¯', ' ', regex=False)
95
- extracted['message_date'] = extracted['Date'] + ", " + extracted['Time']
96
-
97
- # Handle Sender
98
- extracted['Sender'] = extracted['Sender'].fillna('group_notification')
99
-
100
- # Rename columns
101
- df = extracted.rename(columns={'Sender': 'user', 'Message': 'message'})
102
-
103
- # Filter out system messages
104
- df = df[df['user'].str.lower() != 'system']
105
-
106
- # Convert date
107
- df['date'] = pd.to_datetime(df['message_date'], format='%m/%d/%y, %I:%M %p', errors='coerce')
108
-
109
- # Filter out invalid dates
110
- df = df.dropna(subset=['date'])
111
-
112
- # Filter out group notifications
113
- df = df[df["user"] != "group_notification"]
114
- df.reset_index(drop=True, inplace=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
 
116
- # unfiltered messages
 
 
117
  df["unfiltered_messages"] = df["message"]
118
- # Clean messages
119
  df["message"] = df["message"].apply(clean_message)
120
 
121
  # Extract time-based features
122
- df['year'] = df['date'].dt.year
123
  df['month'] = df['date'].dt.month_name()
124
- df['day'] = df['date'].dt.day
125
- df['hour'] = df['date'].dt.hour
126
  df['day_of_week'] = df['date'].dt.day_name()
127
- df['minute'] = df['date'].dt.minute
128
-
129
- period = []
130
- for hour in df['hour']:
131
- if hour == 23:
132
- period.append(str(hour) + "-" + str('00'))
133
- elif hour == 0:
134
- period.append(str('00') + "-" + str(hour + 1))
135
- else:
136
- period.append(str(hour) + "-" + str(hour + 1))
137
-
138
- df['period'] = period
139
-
140
- return df
141
-
142
- def analyze_sentiment_and_topics(df):
143
- """
144
- Performs heavy NLP tasks: Lemmatization, Sentiment Analysis, and Topic Modeling.
145
- Includes sampling for large datasets.
146
- """
147
- # Sampling Logic: Cap at 5000 messages for deep analysis
148
- original_df_len = len(df)
149
- if len(df) > 5000:
150
- print(f"Sampling 5000 messages from {len(df)}...")
151
- # We keep the original index to potentially map back, but for now we just work on the sample
152
- df_sample = df.sample(5000, random_state=42).copy()
153
- else:
154
- df_sample = df.copy()
155
-
156
- # Filter and lemmatize messages
157
- lemmatized_messages = []
158
- # Optimization: Detect dominant language on a sample
159
- sample_size = min(len(df_sample), 500)
160
- sample_text = " ".join(df_sample["message"].sample(sample_size, random_state=42).tolist())
161
- try:
162
- dominant_lang = detect(sample_text)
163
- except LangDetectException:
164
- dominant_lang = 'en'
165
-
166
- nlp = nlp_fr if dominant_lang == 'fr' else nlp_en
167
 
168
- # Use nlp.pipe for batch processing
169
  lemmatized_messages = []
170
- for doc in nlp.pipe(df_sample["message"].tolist(), batch_size=1000, disable=["ner", "parser"]):
171
- lemmatized_messages.append(" ".join([token.lemma_ for token in doc if not token.is_punct]))
172
-
173
- df_sample["lemmatized_message"] = lemmatized_messages
174
-
175
- # Apply sentiment analysis
176
- # Use batch processing for speed
177
- df_sample['sentiment'] = predict_sentiment_bert_batch(df_sample["message"].tolist(), batch_size=128)
178
-
179
- # Filter out rows with null lemmatized_message
180
- df_sample = df_sample.dropna(subset=['lemmatized_message'])
181
-
182
- # **Fix: Use a custom stop word list**
183
  vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words=custom_stop_words)
184
- try:
185
- dtm = vectorizer.fit_transform(df_sample['lemmatized_message'])
186
- except ValueError:
187
- # Handle case where vocabulary is empty (e.g. all stop words)
188
- print("Warning: Empty vocabulary after filtering. Returning empty topics.")
189
- return df_sample, []
190
 
191
  # Apply LDA
192
  lda = LatentDirichletAllocation(n_components=5, random_state=42)
@@ -194,17 +170,63 @@ def analyze_sentiment_and_topics(df):
194
 
195
  # Assign topics to messages
196
  topic_results = lda.transform(dtm)
197
- df_sample = df_sample.iloc[:topic_results.shape[0]].copy()
198
- df_sample['topic'] = topic_results.argmax(axis=1)
199
 
200
  # Store topics for visualization
201
  topics = []
202
  for topic in lda.components_:
203
  topics.append([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-10:]])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
204
 
205
- # If we sampled, we return the sampled dataframe with sentiment/topics.
206
- # The main app will need to handle that 'df' (full) and 'df_analyzed' (sample) might be different.
207
- # Or we can try to merge back? Merging back 5000 sentiments to 40000 messages leaves 35000 nulls.
208
- # For visualization purposes (pie charts, etc), using the sample is usually fine as it's representative.
209
 
210
- return df_sample, topics
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import re
2
  import pandas as pd
 
 
3
  import spacy
4
+ from langdetect import detect_langs
5
+ from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
6
  from sklearn.decomposition import LatentDirichletAllocation
7
  from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
8
  from spacy.lang.fr.stop_words import STOP_WORDS as FRENCH_STOP_WORDS
 
9
  from sklearn.cluster import KMeans
10
  from sklearn.manifold import TSNE
11
  import numpy as np
12
+ import torch
13
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
14
+ import streamlit as st
15
+ from datetime import datetime
16
 
 
 
 
17
 
18
+ # Lighter model
19
+ MODEL ="cardiffnlp/twitter-xlm-roberta-base-sentiment"
20
+
21
+ # Cache model loading with fallback for quantization
22
+ @st.cache_resource
23
+ def load_model():
24
+ device = "cuda" if torch.cuda.is_available() else "cpu"
25
+ print(f"Using device: {device}")
26
+ tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)
27
+ model = AutoModelForSequenceClassification.from_pretrained(MODEL).to(device)
28
+
29
+ # Attempt quantization with fallback
30
+ try:
31
+ # Set quantization engine explicitly (fbgemm for x86, qnnpack for ARM)
32
+ torch.backends.quantized.engine = 'fbgemm' if torch.cuda.is_available() else 'qnnpack'
33
+ model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
34
+ print("Model quantized successfully.")
35
+ except RuntimeError as e:
36
+ print(f"Quantization failed: {e}. Using non-quantized model.")
37
+
38
+ config = AutoConfig.from_pretrained(MODEL)
39
+ return tokenizer, model, config, device
40
+
41
+ tokenizer, model, config, device = load_model()
42
+
43
+ nlp_fr = spacy.load("fr_core_news_sm")
44
+ nlp_en = spacy.load("en_core_web_sm")
45
  custom_stop_words = list(ENGLISH_STOP_WORDS.union(FRENCH_STOP_WORDS))
46
 
47
+ def preprocess(text):
48
+ if text is None:
49
+ return ""
50
+ if not isinstance(text, str):
51
+ try:
52
+ text = str(text)
53
+ except:
54
+ return ""
55
+ new_text = []
56
+ for t in text.split(" "):
57
+ t = '@user' if t.startswith('@') and len(t) > 1 else t
58
+ t = 'http' if t.startswith('http') else t
59
+ new_text.append(t)
60
+ return " ".join(new_text)
61
+
62
+ def clean_message(text):
63
+ if not isinstance(text, str):
64
+ return ""
65
+ text = text.lower()
66
+ text = text.replace("<media omitted>", "").replace("this message was deleted", "").replace("null", "")
67
+ text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)
68
+ text = re.sub(r"[^a-zA-ZΓ€-ΓΏ0-9\s]", "", text)
69
+ return text.strip()
70
+
71
  def lemmatize_text(text, lang):
72
  if lang == 'fr':
73
  doc = nlp_fr(text)
 
75
  doc = nlp_en(text)
76
  return " ".join([token.lemma_ for token in doc if not token.is_punct])
77
 
78
+ def preprocess(data):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
  pattern = r"^(?P<Date>\d{1,2}/\d{1,2}/\d{2,4}),\s+(?P<Time>[\d:]+(?:\S*\s?[AP]M)?)\s+-\s+(?:(?P<Sender>.*?):\s+)?(?P<Message>.*)$"
80
+ filtered_messages, valid_dates = [], []
81
+
82
+ for line in data.strip().split("\n"):
83
+ match = re.match(pattern, line)
84
+ if match:
85
+ entry = match.groupdict()
86
+ sender = entry.get("Sender")
87
+ if sender and sender.strip().lower() != "system":
88
+ filtered_messages.append(f"{sender.strip()}: {entry['Message']}")
89
+ valid_dates.append(f"{entry['Date']}, {entry['Time'].replace('Ò€¯', ' ')}")
90
+ print("-_____--------------__________----------_____________----------______________")
91
+ def convert_to_target_format(date_str):
92
+ try:
93
+ # Attempt to parse the original date string
94
+ dt = datetime.strptime(date_str, '%d/%m/%Y, %H:%M')
95
+ except ValueError:
96
+ # Return the original date string if parsing fails
97
+ return date_str
98
+
99
+ # Extract components without leading zeros
100
+ month = dt.month
101
+ day = dt.day
102
+ year_short = dt.strftime('%y') # Last two digits of the year
103
+
104
+ # Convert to 12-hour format and determine AM/PM
105
+ hour_12 = dt.hour % 12
106
+ if hour_12 == 0:
107
+ hour_12 = 12 # Adjust 0 (from 12 AM/PM) to 12
108
+ hour_str = str(hour_12)
109
+
110
+ # Format minute with leading zero if necessary
111
+ minute_str = f"{dt.minute:02d}"
112
+
113
+ # Get AM/PM designation
114
+ am_pm = dt.strftime('%p')
115
+
116
+ # Construct the formatted date string with Unicode narrow space
117
+ return f"{month}/{day}/{year_short}, {hour_str}:{minute_str}\u202f{am_pm}"
118
+
119
+ converted_dates = [convert_to_target_format(date) for date in valid_dates]
120
+
121
+
122
+ df = pd.DataFrame({'user_message': filtered_messages, 'message_date': converted_dates})
123
+ df['message_date'] = pd.to_datetime(df['message_date'], format='%m/%d/%y, %I:%M %p', errors='coerce')
124
+ df.rename(columns={'message_date': 'date'}, inplace=True)
125
+
126
+ users, messages = [], []
127
+ msg_pattern = r"^(.*?):\s(.*)$"
128
+ for message in df["user_message"]:
129
+ match = re.match(msg_pattern, message)
130
+ if match:
131
+ users.append(match.group(1))
132
+ messages.append(match.group(2))
133
+ else:
134
+ users.append("group_notification")
135
+ messages.append(message)
136
 
137
+ df["user"] = users
138
+ df["message"] = messages
139
+ df = df[df["user"] != "group_notification"].reset_index(drop=True)
140
  df["unfiltered_messages"] = df["message"]
 
141
  df["message"] = df["message"].apply(clean_message)
142
 
143
  # Extract time-based features
144
+ df['year'] = pd.to_numeric(df['date'].dt.year, downcast='integer')
145
  df['month'] = df['date'].dt.month_name()
146
+ df['day'] = pd.to_numeric(df['date'].dt.day, downcast='integer')
147
+ df['hour'] = pd.to_numeric(df['date'].dt.hour, downcast='integer')
148
  df['day_of_week'] = df['date'].dt.day_name()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
 
150
+ # Lemmatize messages for topic modeling
151
  lemmatized_messages = []
152
+ for message in df["message"]:
153
+ try:
154
+ lang = detect_langs(message)
155
+ lemmatized_messages.append(lemmatize_text(message, lang))
156
+ except:
157
+ lemmatized_messages.append("")
158
+ df["lemmatized_message"] = lemmatized_messages
159
+
160
+ df = df[df["message"].notnull() & (df["message"] != "")].copy()
161
+ df.drop(columns=["user_message"], inplace=True)
162
+
163
+ # Perform topic modeling
 
164
  vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words=custom_stop_words)
165
+ dtm = vectorizer.fit_transform(df['lemmatized_message'])
 
 
 
 
 
166
 
167
  # Apply LDA
168
  lda = LatentDirichletAllocation(n_components=5, random_state=42)
 
170
 
171
  # Assign topics to messages
172
  topic_results = lda.transform(dtm)
173
+ df = df.iloc[:topic_results.shape[0]].copy()
174
+ df['topic'] = topic_results.argmax(axis=1)
175
 
176
  # Store topics for visualization
177
  topics = []
178
  for topic in lda.components_:
179
  topics.append([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-10:]])
180
+ print("Top words for each topic-----------------------------------------------------:")
181
+ print(topics)
182
+
183
+ return df, topics
184
+
185
+ def preprocess_for_clustering(df, n_clusters=5):
186
+ df = df[df["lemmatized_message"].notnull() & (df["lemmatized_message"].str.strip() != "")]
187
+ df = df.reset_index(drop=True)
188
+
189
+ vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
190
+ tfidf_matrix = vectorizer.fit_transform(df['lemmatized_message'])
191
+
192
+ if tfidf_matrix.shape[0] < 2:
193
+ raise ValueError("Not enough messages for clustering.")
194
+
195
+ df = df.iloc[:tfidf_matrix.shape[0]].copy()
196
+
197
+ kmeans = KMeans(n_clusters=n_clusters, random_state=42)
198
+ clusters = kmeans.fit_predict(tfidf_matrix)
199
 
200
+ df['cluster'] = clusters
201
+ tsne = TSNE(n_components=2, random_state=42)
202
+ reduced_features = tsne.fit_transform(tfidf_matrix.toarray())
 
203
 
204
+ return df, reduced_features, kmeans.cluster_centers_
205
+
206
+
207
+ def predict_sentiment_batch(texts: list, batch_size: int = 32) -> list:
208
+ """Predict sentiment for a batch of texts"""
209
+ if not isinstance(texts, list):
210
+ raise TypeError(f"Expected list of texts, got {type(texts)}")
211
+
212
+ processed_texts = [preprocess(text) for text in texts]
213
+
214
+ predictions = []
215
+ for i in range(0, len(processed_texts), batch_size):
216
+ batch = processed_texts[i:i+batch_size]
217
+
218
+ inputs = tokenizer(
219
+ batch,
220
+ padding=True,
221
+ truncation=True,
222
+ return_tensors="pt",
223
+ max_length=128
224
+ ).to(device)
225
+
226
+ with torch.no_grad():
227
+ outputs = model(**inputs)
228
+
229
+ batch_preds = outputs.logits.argmax(dim=1).cpu().numpy()
230
+ predictions.extend([config.id2label[p] for p in batch_preds])
231
+
232
+ return predictions
profile_performance.py DELETED
@@ -1,70 +0,0 @@
1
- import time
2
- import pandas as pd
3
- import preprocessor
4
- import random
5
-
6
- def generate_large_chat(lines=10000):
7
- """Generates a synthetic WhatsApp chat log."""
8
- senders = ["User1", "User2", "User3"]
9
- messages = [
10
- "Hello there, how are you?",
11
- "I am doing great, thanks for asking! Project update?",
12
- "This is a test message to simulate a long chat about artificial intelligence.",
13
- "Meeting is at 10 AM tomorrow to discuss the roadmap.",
14
- "Check out this link: https://example.com",
15
- "Haha that is funny πŸ˜‚",
16
- "Je parle un peu franΓ§ais aussi. C'est la vie.",
17
- "Non, je ne crois pas. Il fait beau aujourd'hui.",
18
- "Ok, see you later. Don't forget the deadline.",
19
- "Python is a great programming language for data science.",
20
- "Streamlit makes building apps very easy and fast."
21
- ]
22
-
23
- chat_data = []
24
- for _ in range(lines):
25
- date = f"{random.randint(1, 12)}/{random.randint(1, 28)}/23"
26
- hour = random.randint(1, 12)
27
- minute = random.randint(10, 59)
28
- ampm = random.choice(["AM", "PM"])
29
- time_str = f"{hour}:{minute} {ampm}"
30
- sender = random.choice(senders)
31
- message = random.choice(messages)
32
- chat_data.append(f"{date}, {time_str} - {sender}: {message}")
33
-
34
- return "\n".join(chat_data)
35
-
36
- def profile_preprocessing():
37
- print("Generating synthetic data (10,000 lines)...")
38
- raw_data = generate_large_chat(10000)
39
- print(f"Data size: {len(raw_data) / 1024 / 1024:.2f} MB")
40
-
41
- print("\nStarting profiling...")
42
- start_total = time.time()
43
-
44
- # We can't easily profile inside the function without modifying it,
45
- # so we will measure the total time and infer from code analysis
46
- # or modify preprocessor.py temporarily to print timings.
47
- # For now, let's just run it and see the total time.
48
-
49
- try:
50
- start_time = time.time()
51
-
52
- # Step 1: Parse
53
- df = preprocessor.parse_data(raw_data)
54
- print(f"Parsing took: {time.time() - start_time:.2f}s")
55
-
56
- # Step 2: Analyze
57
- step_start = time.time()
58
- df, topics = preprocessor.analyze_sentiment_and_topics(df)
59
- print(f"Analysis took: {time.time() - step_start:.2f}s")
60
- end_total = time.time()
61
- print(f"\nTotal Preprocessing Time: {end_total - start_total:.2f} seconds")
62
- print(f"Messages processed: {len(df)}")
63
-
64
- except Exception as e:
65
- print(f"Error: {e}")
66
- import traceback
67
- traceback.print_exc()
68
-
69
- if __name__ == "__main__":
70
- profile_preprocessing()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
reproduce_issue.py DELETED
@@ -1,27 +0,0 @@
1
- import pandas as pd
2
- from collections import Counter
3
- import emoji
4
-
5
- def emoji_helper_simulated(emojis_list):
6
- emoji_df = pd.DataFrame(Counter(emojis_list).most_common(len(Counter(emojis_list))))
7
- return emoji_df
8
-
9
- # Case 1: Emojis present
10
- print("Case 1: Emojis present")
11
- df1 = emoji_helper_simulated(['πŸ˜€', 'πŸ˜€', 'πŸ˜‚'])
12
- print(df1)
13
- try:
14
- print(df1[1].head())
15
- print("Access successful")
16
- except KeyError as e:
17
- print(f"KeyError: {e}")
18
-
19
- # Case 2: No emojis
20
- print("\nCase 2: No emojis")
21
- df2 = emoji_helper_simulated([])
22
- print(df2)
23
- try:
24
- print(df2[1].head())
25
- print("Access successful")
26
- except KeyError as e:
27
- print(f"KeyError: {e}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
requirements.txt CHANGED
@@ -1,6 +1,6 @@
1
  streamlit
2
- matplotlib==3.7.1
3
  preprocessor
 
4
  seaborn
5
  urlextract
6
  wordcloud
@@ -18,7 +18,6 @@ plotly
18
  nltk
19
  spacy==3.7.0
20
  thinc>=8.1.8,<8.3.0
21
- python-dotenv
22
  deep_translator
23
  https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl
24
  https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.7.0/fr_core_news_sm-3.7.0-py3-none-any.whl
 
1
  streamlit
 
2
  preprocessor
3
+ matplotlib
4
  seaborn
5
  urlextract
6
  wordcloud
 
18
  nltk
19
  spacy==3.7.0
20
  thinc>=8.1.8,<8.3.0
 
21
  deep_translator
22
  https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl
23
  https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.7.0/fr_core_news_sm-3.7.0-py3-none-any.whl
sentiment.py CHANGED
@@ -1,27 +1,27 @@
1
  import pandas as pd
 
2
  import torch
3
- from sklearn.metrics import accuracy_score, classification_report
4
  from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
5
 
6
- MODEL = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
 
7
 
8
- # Check if GPU is available
9
- # Check if GPU is available (CUDA or MPS)
10
- if torch.cuda.is_available():
11
- device = torch.device("cuda")
12
- elif torch.backends.mps.is_available():
13
- device = torch.device("mps")
14
- else:
15
- device = torch.device("cpu")
16
  print(f"Using device: {device}")
17
 
18
- # Load the model and tokenizer
19
- model = AutoModelForSequenceClassification.from_pretrained(MODEL)
20
- model.to(device) # Move model to GPU
21
- tokenizer = AutoTokenizer.from_pretrained(MODEL)
22
  config = AutoConfig.from_pretrained(MODEL)
23
 
24
- # Preprocess text (username and link placeholders)
 
 
 
25
  def preprocess(text):
26
  if not isinstance(text, str):
27
  text = str(text) if not pd.isna(text) else ""
@@ -31,58 +31,68 @@ def preprocess(text):
31
  t = '@user' if t.startswith('@') and len(t) > 1 else t
32
  t = 'http' if t.startswith('http') else t
33
  new_text.append(t)
34
-
35
  return " ".join(new_text)
36
 
37
- # Sentiment prediction using GPU (Batch Processing)
38
- def predict_sentiment_bert_batch(texts: list, batch_size: int = 32) -> list:
39
- all_sentiments = []
 
40
 
41
- # Preprocess all texts
42
- processed_texts = [preprocess(text) for text in texts]
 
 
43
 
44
- # Process in batches
 
 
 
45
  for i in range(0, len(processed_texts), batch_size):
46
- batch_texts = processed_texts[i:i + batch_size]
47
-
48
- encoded_input = tokenizer(
49
- batch_texts,
50
- return_tensors='pt',
51
- truncation=True,
52
- padding=True,
53
- max_length=128
54
- )
55
-
56
- encoded_input.pop("token_type_ids", None) # XLM-Roberta doesn't use these
57
-
58
- # Move input tensors to the same device as the model
59
- encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
60
-
61
- model.eval()
62
- with torch.no_grad():
63
- output = model(**encoded_input)
64
 
65
- # Get predictions for the batch
66
- batch_indices = output.logits.argmax(dim=1).tolist()
67
- batch_sentiments = [config.id2label[idx] for idx in batch_indices]
68
- all_sentiments.extend(batch_sentiments)
 
69
 
70
- return all_sentiments
 
 
 
 
 
71
 
72
- # Keep single prediction for backward compatibility if needed, but it calls batch with size 1
73
- def predict_sentiment_bert(text: str) -> str:
74
- return predict_sentiment_bert_batch([text], batch_size=1)[0]
 
75
 
76
- # Example usage (optional)
77
- # print(predict_sentiment_bert("This is amazing!"))
78
 
79
- # Predict on full dataset (uncomment when ready)
80
- # test_data['predicted_sentiment'] = test_data['text'].apply(predict_sentiment_bert)
81
 
82
- # Calculate accuracy (uncomment when ready)
83
- # accuracy = accuracy_score(test_labels['label'], test_data['predicted_sentiment'])
84
- # print(f"Accuracy: {accuracy:.2f}")
 
 
 
 
85
 
86
- # Generate classification report (uncomment when ready)
87
- # report = classification_report(test_labels['label'], test_data['predicted_sentiment'])
88
- # print("Classification Report:\n", report)
 
1
  import pandas as pd
2
+ import time
3
  import torch
 
4
  from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
5
 
6
+ # Use a sentiment-specific model (replace with TinyBERT if fine-tuned)
7
+ MODEL = "tabularisai/multilingual-sentiment-analysis" # Pre-trained for positive/negative sentiment
8
 
9
+ print("Loading model and tokenizer...")
10
+ start_load = time.time()
11
+
12
+ # Check for MPS (Metal) availability on M2 chip, fallback to CPU
13
+ device = "mps" if torch.backends.mps.is_available() else "cpu"
 
 
 
14
  print(f"Using device: {device}")
15
 
16
+ # Load with optimizations (only once, removing redundancy)
17
+ tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)
18
+ model = AutoModelForSequenceClassification.from_pretrained(MODEL).to(device)
 
19
  config = AutoConfig.from_pretrained(MODEL)
20
 
21
+ load_time = time.time() - start_load
22
+ print(f"Model and tokenizer loaded in {load_time:.2f} seconds\n")
23
+
24
+ # Optimized preprocessing (unchanged from your code)
25
  def preprocess(text):
26
  if not isinstance(text, str):
27
  text = str(text) if not pd.isna(text) else ""
 
31
  t = '@user' if t.startswith('@') and len(t) > 1 else t
32
  t = 'http' if t.startswith('http') else t
33
  new_text.append(t)
 
34
  return " ".join(new_text)
35
 
36
+ # Batch prediction function (optimized for performance)
37
+ def predict_sentiment_batch(texts: list, batch_size: int = 16) -> list:
38
+ if not isinstance(texts, list):
39
+ raise TypeError(f"Expected list of texts, got {type(texts)}")
40
 
41
+ # Validate and clean inputs
42
+ valid_texts = [str(text) for text in texts if isinstance(text, str) and text.strip()]
43
+ if not valid_texts:
44
+ return [] # Return empty list if no valid texts
45
 
46
+ print(f"Processing {len(valid_texts)} valid samples...")
47
+ processed_texts = [preprocess(text) for text in valid_texts]
48
+
49
+ predictions = []
50
  for i in range(0, len(processed_texts), batch_size):
51
+ batch = processed_texts[i:i + batch_size]
52
+ try:
53
+ inputs = tokenizer(
54
+ batch,
55
+ padding=True,
56
+ truncation=True,
57
+ return_tensors="pt",
58
+ max_length=64 # Reduced for speed on short texts like tweets
59
+ ).to(device)
60
+
61
+ with torch.no_grad():
62
+ outputs = model(**inputs)
 
 
 
 
 
 
63
 
64
+ batch_preds = outputs.logits.argmax(dim=1).cpu().numpy()
65
+ predictions.extend([config.id2label[p] for p in batch_preds])
66
+ except Exception as e:
67
+ print(f"Error processing batch {i // batch_size}: {str(e)}")
68
+ predictions.extend(["neutral"] * len(batch)) # Consider logging instead
69
 
70
+ print(f"Predictions for {len(valid_texts)} samples generated in {time.time() - start_load:.2f} seconds")
71
+ predictions = [prediction.lower().replace("very ", "") for prediction in predictions]
72
+
73
+ print(predictions)
74
+
75
+ return predictions
76
 
77
+ # # Example usage with your dataset (uncomment and adjust paths)
78
+ # test_data = pd.read_csv("/Users/caasidev/development/AI/last try/Whatssap-project/srcs/tweets.csv")
79
+ # print(f"Processing {len(test_data)} samples...")
80
+ # start_prediction = time.time()
81
 
82
+ # text_samples = test_data['text'].tolist()
83
+ # test_data['predicted_sentiment'] = predict_sentiment_batch(text_samples)
84
 
85
+ # prediction_time = time.time() - start_prediction
86
+ # time_per_sample = prediction_time / len(test_data)
87
 
88
+ # # Print runtime statistics
89
+ # print("\nRuntime Statistics:")
90
+ # print(f"- Model loading time: {load_time:.2f} seconds")
91
+ # print(f"- Total prediction time for {len(test_data)} samples: {prediction_time:.2f} seconds")
92
+ # print(f"- Average time per sample: {time_per_sample:.4f} seconds")
93
+ # print(f"- Estimated time for 1000 samples: {(time_per_sample * 1000):.2f} seconds")
94
+ # print(f"- Estimated time for 20000 samples: {(time_per_sample * 20000 / 60):.2f} minutes")
95
 
96
+ # # Print a sample of predictions
97
+ # print("\nPredicted Sentiments (first 5 samples):")
98
+ # print(test_data[['text', 'predicted_sentiment']].head())
sentiment_train.py DELETED
@@ -1,41 +0,0 @@
1
- import os
2
- import joblib
3
- import string
4
- import re
5
- import nltk
6
- from nltk.corpus import stopwords
7
-
8
- # Get the directory of the current script
9
- script_dir = os.path.dirname(__file__)
10
-
11
- # Construct paths to the model and vectorizer files
12
- model_path = os.path.join(script_dir, "naive_bayes_model.pkl")
13
- vectorizer_path = os.path.join(script_dir, "tfidf_vectorizer.pkl")
14
-
15
- # Load saved model and vectorizer
16
- try:
17
- model = joblib.load(model_path)
18
- vectorizer = joblib.load(vectorizer_path)
19
- except FileNotFoundError as e:
20
- print(f"Error: {e}")
21
- raise
22
-
23
- # Load stopwords
24
- nltk.download("stopwords")
25
- stop_words = set(stopwords.words("english") + stopwords.words("french"))
26
-
27
- # Function to clean text (must match preprocessing in training script)
28
- def clean_text(text):
29
- if isinstance(text, float):
30
- return ""
31
- text = text.lower()
32
- text = re.sub(f"[{string.punctuation}]", "", text)
33
- text = " ".join([word for word in text.split() if word not in stop_words])
34
- return text
35
-
36
- # Function to predict sentiment
37
- def predict_sentiment(text):
38
- cleaned_text = clean_text(text)
39
- vectorized_text = vectorizer.transform([cleaned_text])
40
- prediction = model.predict(vectorized_text)
41
- return prediction[0]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
test.py DELETED
@@ -1,67 +0,0 @@
1
- import nltk
2
- import string
3
- import re
4
- import pandas as pd
5
- import numpy as np
6
- import joblib
7
- from nltk.corpus import stopwords
8
- from sklearn.feature_extraction.text import TfidfVectorizer
9
- from sklearn.model_selection import train_test_split
10
- from sklearn.naive_bayes import MultinomialNB
11
- from sklearn.metrics import accuracy_score, classification_report
12
- from googletrans import Translator
13
- from imblearn.over_sampling import SMOTE
14
-
15
- nltk.download('stopwords')
16
- nltk.download('punkt')
17
-
18
- translator = Translator()
19
-
20
- # Load dataset
21
- data = pd.read_csv('/Users/caasidev/development/AI/datasets/train.csv', encoding='ISO-8859-1')
22
-
23
- # Drop missing values
24
- data = data.dropna(subset=['text', 'sentiment'])
25
-
26
- stop_words = set(stopwords.words('english') + stopwords.words('french'))
27
-
28
- # Function to clean text
29
- def clean_text(text):
30
- if isinstance(text, float):
31
- return ""
32
- text = text.lower()
33
- text = re.sub(f"[{string.punctuation}]", "", text)
34
- text = " ".join([word for word in text.split() if word not in stop_words])
35
- return text
36
-
37
- # Apply text cleaning
38
- data['Cleaned_Text'] = data['text'].apply(clean_text)
39
-
40
- # **Vectorization BEFORE SMOTE**
41
- vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_df=0.85, min_df=2, max_features=10000)
42
- X_tfidf = vectorizer.fit_transform(data['Cleaned_Text'])
43
- y = data['sentiment']
44
-
45
- # Apply SMOTE **after** vectorization
46
- smote = SMOTE(random_state=42)
47
- X_resampled, y_resampled = smote.fit_resample(X_tfidf, y)
48
-
49
- # Train-test split
50
- X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)
51
-
52
- # Train Naive Bayes
53
- model = MultinomialNB(alpha=0.5)
54
- model.fit(X_train, y_train)
55
-
56
- # Save model and vectorizer
57
- joblib.dump(model, "naive_bayes_model.pkl")
58
- joblib.dump(vectorizer, "tfidf_vectorizer.pkl")
59
- print("Model and vectorizer saved successfully!")
60
-
61
- # Predictions
62
- y_pred = model.predict(X_test)
63
-
64
- # Evaluation
65
- accuracy = accuracy_score(y_test, y_pred)
66
- print(f"Improved Accuracy: {accuracy * 100:.2f}%")
67
- print("\nClassification Report:\n", classification_report(y_test, y_pred))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
tfidf_vectorizer.pkl DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:a6f08647a3d11077c7c5922973244b227a8d2b2e7ac46707d9f258c8a92de1b5
3
- size 375493
 
 
 
 
verify_fix.py DELETED
@@ -1,48 +0,0 @@
1
- import pandas as pd
2
- import helper
3
-
4
- # Create a dummy dataframe with no emojis
5
- data = {
6
- 'user': ['User1', 'User2'],
7
- 'message': ['Hello world', 'This is a test'],
8
- 'unfiltered_messages': ['Hello world', 'This is a test']
9
- }
10
- df = pd.DataFrame(data)
11
-
12
- print("Testing emoji_helper with no emojis...")
13
- try:
14
- emoji_df = helper.emoji_helper('Overall', df)
15
- print("emoji_df columns:", emoji_df.columns.tolist())
16
- print("emoji_df shape:", emoji_df.shape)
17
-
18
- if 0 in emoji_df.columns and 1 in emoji_df.columns:
19
- print("SUCCESS: Columns 0 and 1 exist.")
20
- else:
21
- print("FAILURE: Columns 0 and 1 missing.")
22
-
23
- except Exception as e:
24
- print(f"FAILURE: Exception occurred: {e}")
25
-
26
- # Test with emojis to ensure no regression
27
- data_with_emoji = {
28
- 'user': ['User1'],
29
- 'message': ['Hello πŸ˜€'],
30
- 'unfiltered_messages': ['Hello πŸ˜€']
31
- }
32
- df_emoji = pd.DataFrame(data_with_emoji)
33
-
34
- print("\nTesting emoji_helper with emojis...")
35
- try:
36
- emoji_df = helper.emoji_helper('Overall', df_emoji)
37
- print("emoji_df columns:", emoji_df.columns.tolist())
38
- print("emoji_df shape:", emoji_df.shape)
39
-
40
- if 0 in emoji_df.columns and 1 in emoji_df.columns:
41
- print("SUCCESS: Columns 0 and 1 exist.")
42
- else:
43
- print("FAILURE: Columns 0 and 1 missing.")
44
-
45
- except Exception as e:
46
- print(f"FAILURE: Exception occurred: {e}")
47
-
48
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
verify_refactor.py DELETED
@@ -1,41 +0,0 @@
1
- import helper
2
- from openrouter_chat import get_chat_completion
3
-
4
- def test_openrouter_connection():
5
- print("Testing OpenRouter connection...")
6
- try:
7
- response = get_chat_completion([{"role": "user", "content": "Hello"}])
8
- if response:
9
- print("βœ… OpenRouter connection successful.")
10
- else:
11
- print("❌ OpenRouter connection failed (empty response).")
12
- except Exception as e:
13
- print(f"❌ OpenRouter connection failed: {e}")
14
-
15
- def test_title_generation():
16
- print("\nTesting Title Generation from Messages...")
17
- messages = [
18
- "Hey, are we still on for the movies tonight?",
19
- "Yeah, lets go watch that new sci-fi one.",
20
- "Cool, I'll buy the tickets online.",
21
- "Meet you at the cinema at 7?"
22
- ]
23
-
24
- topic_map = {0: messages}
25
-
26
- try:
27
- titles = helper.generate_topic_titles_from_messages(topic_map)
28
- title = titles.get(0)
29
- print(f"Generated Title: {title}")
30
-
31
- if title and title != "Topic 0":
32
- print("βœ… Title generation successful.")
33
- else:
34
- print("⚠️ Title generation returned default or empty.")
35
-
36
- except Exception as e:
37
- print(f"❌ Title generation failed: {e}")
38
-
39
- if __name__ == "__main__":
40
- test_openrouter_connection()
41
- test_title_generation()