DanielKiani commited on
Commit
64fbadd
·
1 Parent(s): 43c6927

Feat: Introduce ReviewSense v2.0 with RAG Chatbot and Mistral LLM

Browse files
README.md CHANGED
@@ -1,188 +1,295 @@
1
  ![Banner](assets/banner.png)
2
- [![Python](https://img.shields.io/badge/Python-3.12.11-blue?logo=python)](https://www.python.org/)[![PyTorch](https://img.shields.io/badge/PyTorch-2.8-EE4C2C?logo=pytorch)](https://pytorch.org/)![Made with ML](https://img.shields.io/badge/Made%20with-ML-blueviolet?logo=openai)[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
3
 
4
- # 🛍️ ReviewSense: Product Review Analysis Engine
5
 
6
- > *ReviewSense is a comprehensive, end-to-end Natural Language Processing application built to extract deep, actionable insights from unstructured product reviews.*
7
- Where a simple star rating only tells part of the story, ReviewSense dives into the text to uncover what customers are saying, why they're saying it, and how they feel about specific product features.
8
 
9
- This project demonstrates a complete **MLOps workflow**, from initial data preparation and model fine-tuning to the development and deployment of a multi-model, interactive web application with Gradio.
10
 
11
- ![Gradio app](assets/gradio.png)
12
- You can find the Gradio app [Here](https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews)
 
 
 
13
 
14
  ---
15
 
16
  ## 📋 Table of Contents
17
 
18
- - [📖 Overview](#-overview)
19
- - [ Key Features](#-key-features)
20
- - [📈 Model Development and Benchmarking](#-model-development-and-benchmarking)
21
- - [🧠 How It Works: The Analysis Pipeline](#-how-it-works-the-analysis-pipeline)
 
 
22
  - [🔮 Future Improvements](#-future-improvements)
23
- - [⚙️ Setup and Installation](#️-setup-and-installation)
24
- - [▶️ Usage](#️-usage)
25
- - [📁 Project Structure](#-project-structure)
26
- - [🛠️ Technologies and Models](#️-technologies-and-models)
 
27
 
28
  ---
29
 
30
  ## 📖 Overview
31
 
32
- In the world of e-commerce, customer reviews are a **goldmine of information**. However, manually reading through thousands of reviews is impossible.
33
-
34
- This project solves that problem by creating an automated system that performs a **multi-layered analysis** on any given product review, providing a structured output that is far more valuable than a simple positive/negative label.
35
 
36
- The application is built to be **modular** and showcases the power of combining a custom fine-tuned model with several pre-trained transformers, each specialized for a specific task.
37
 
38
  ---
39
 
40
- ## Key Features
41
 
42
- - 📈 **Overall Sentiment Analysis**
43
- Classifies reviews as **POSITIVE** or **NEGATIVE**, powered by a fine-tuned DistilBERT model on the Amazon Reviews dataset.
44
 
45
- - 🔎 **Dynamic Aspect Extraction**
46
- Automatically identifies product features (e.g., *camera, battery life, plot*) using POS tagging combined with category-specific dictionaries.
 
 
 
 
 
 
 
 
 
 
 
47
 
48
- - 📊 **Aspect-Based Sentiment Analysis (ABSA)**
49
- Determines specific sentiment (*Positive, Negative, Neutral*) for each aspect, e.g., *“loved the camera, disappointed with the battery life.”*
50
 
51
- - 📝 **Abstractive Summarization**
52
- Generates concise summaries of reviews using a pre-trained DistilBART model.
53
 
54
- - 🚀 **Interactive UI**
55
- A clean, user-friendly **Gradio interface** for real-time review analysis.
 
 
 
 
 
 
 
 
 
56
 
57
  ---
58
 
59
- ## 📈 Model Development and Benchmarking
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
- A key part of this project was to not just build a model, but to prove its effectiveness. To do this, we followed a standard machine learning practice of establishing a strong baseline before moving to a more complex architecture.
62
 
63
- **1. The Baseline Model**
64
- We first trained a classic `Multinomial Naive Bayes` model. The text was vectorized using `TF-IDF`, and we performed a hyperparameter grid search to find the optimal settings. This approach is fast, interpretable, and provides a strong benchmark for text classification.
65
 
66
- ![Results1](assets/confusion_bay.png)
67
 
68
- - **Baseline Accuracy: 86.56%**
 
 
 
 
 
 
 
 
69
 
70
- **1. The Fine-Tuned Transformer**
71
- Next, we fine-tuned a `DistilBERT` model, a smaller and more efficient variant of BERT. By leveraging its pre-trained understanding of the English language and fine-tuning it on our specific Amazon reviews data, we aimed to capture more of the nuance and context that the baseline model might miss.
72
 
73
- ![Results2](assets/confusion_bert.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
- - **Fine-Tuned Accuracy: ~95%**
76
 
77
- ![Results3](assets/wordcloud.png)
78
 
79
- **Conclusion**
80
 
81
- The significant increase in accuracy from **86.6%** to **~95%** demonstrates the power of transfer learning and the superior contextual understanding of transformer models. This performance gain justified the selection of the more computationally intensive DistilBERT model for the final application.
82
 
83
- ---
84
 
85
- ## 🧠 How It Works: The Analysis Pipeline
86
 
87
- When a user submits a review, the application processes it through these steps:
 
 
 
88
 
89
- 1. **Category Selection** User selects product type (*Phone, Book, etc.*).
90
- 2. **Aspect Extraction** Extracts nouns via POS tagging and filters with category dictionary.
91
- 3. **Aspect Sentiment Analysis** DeBERTa-based ABSA model assigns sentiment per aspect.
92
- 4. **Overall Sentiment Classification** DistilBERT model predicts POSITIVE/NEGATIVE.
93
- 5. **Summarization** DistilBART generates a short summary.
94
- 6. **Display** Results presented in the Gradio UI.
 
 
 
 
 
 
 
 
 
 
 
 
95
 
96
  ---
97
 
98
  ## 🔮 Future Improvements
99
 
100
- - 📂 **Batch Processing Mode** Upload CSV, get analyzed CSV output.
 
 
 
 
101
 
102
- - 📚 **Expand Aspect Dictionaries for more product categories.**
103
 
104
- - 🤖 **Advanced Aspect Extraction** Replace POS heuristics with ATE models for discovering new aspects automatically.
105
- - 💹 **Train the sentiment reviewer for longer and on more of the data for better performance.**
 
106
 
107
  ---
108
 
109
  ## ⚙️ Setup and Installation
110
 
111
- ### 1. Clone the Repository
112
 
113
  ```bash
114
- git clone https://github.com/Deathshot78/ReviewSense.git
115
  cd ReviewSense
116
  ```
117
 
118
- ### 2. Install Required Packages
119
 
120
  ```bash
121
  pip install -r requirements.txt
122
  ```
123
 
124
- ### 3. Download Dataset
125
 
126
- Download Amazon Reviews for Sentiment Analysis from [Kaggle](https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews).
127
 
128
- Place `train.csv` and `test.csv` into the `data/` directory.
129
 
130
- ### 4. Preprocess Data
131
 
132
- ```bash
133
- python data_prepare.py
134
- ```
135
 
136
- ### 5. Fine-Tune Sentiment Model
137
 
138
  ```bash
139
- python python train.py
140
  ```
141
 
142
- ---
143
-
144
- ## ▶️ Usage
145
 
146
- Run the main script:
147
 
148
- ```bash
149
- python python main.py
150
- ```
151
 
152
  ---
153
 
154
- ## 📁 Project Structure
155
 
156
  ```bash
157
- ├── 📄 README.md # Project documentation
158
- ├── 🐍 app.py # Gradio app (UI + logic)
159
- ├── 🐍 models.py # Model definitions & inference classes
160
- ├── 🐍 train.py # Fine-tune sentiment classifier
161
- ├── 🐍 data_prepare.py # Data preprocessing script
162
- ├── 🐍 app.py # Gradio app
163
- ├── 📄 requirements.txt # Dependencies
164
- ├── 📁 data/ # Datasets (local only)
165
- └── 📁 checkpoints/ # Saved model checkpoints
166
  ```
167
 
168
  ---
169
 
170
- ## 🛠️ Technologies and Models
171
 
172
  **Core Technologies**
173
 
174
- - Python, PyTorch, PyTorch Lightning
 
 
 
 
175
 
176
- - Gradio (UI)
177
 
178
- - Pandas, Scikit-learn
179
 
180
- - NLP Models (Hugging Face)
181
 
182
- - Sentiment Classifier: distilbert-base-uncased (fine-tuned)
 
 
 
 
 
 
 
 
183
 
184
- - Review Summarizer: sshleifer/distilbart-cnn-6-6
185
 
186
- - Aspect Extractor: vblagoje/bert-english-uncased-finetuned-pos
187
 
188
- - Aspect Analyzer: yangheng/deberta-v3-base-absa-v1.1
 
1
  ![Banner](assets/banner.png)
2
+ [![Python](https://img.shields.io/badge/Python-3.12+-blue?logo=python)](https://www.python.org/)[![PyTorch](https://img.shields.io/badge/LangChain-Integration-blueviolet)](https://python.langchain.com/)[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
3
 
4
+ # 🛍️ ReviewSense v2.0: Product Review Analysis & Chatbot Engine
5
 
6
+ > *ReviewSense v2.0 expands upon the initial analysis engine, adding a powerful, conversational RAG (Retrieval-Augmented Generation) chatbot. It leverages a single, efficient LLM (Mistral 7B GGUF) to provide both deep batch analysis and interactive Q&A grounded in user-provided reviews.*
 
7
 
8
+ This project demonstrates an end-to-end workflow, integrating data processing, local LLM execution with `LlamaCpp`, vector storage with FAISS, conversational memory, intent classification, and an interactive Gradio web application.
9
 
10
+ ![demo_tab1](assets/tab1.png)
11
+ ![demo_tab2](assets/tab2.png)
12
+
13
+ You can find the Web demo here ➡ [Web Demo](https://huggingface.co/spaces/Deathshot78/ReviewSense)
14
+ **[Note]: running this model on the cpu takes a while to complete you can relax and get a cup of coffee while the model generates responses !☕**
15
 
16
  ---
17
 
18
  ## 📋 Table of Contents
19
 
20
+ - [📖 Overview](#-overview)
21
+ - [🚀 What's New in v2.0](#-whats-new-in-v20)
22
+ - [ Key Features (v2.0)](#-key-features-v20)
23
+ - [🧠 How It Works: The v2.0 Pipeline](#-how-it-works-the-v20-pipeline)
24
+ - [🔧 Challenges & Limitations](#-challenges--limitations)
25
+ - [💡 Prompt Engineering Journey](#-prompt-engineering-journey)
26
  - [🔮 Future Improvements](#-future-improvements)
27
+ - [⚙️ Setup and Installation](#️-setup-and-installation)
28
+ - [▶️ Usage](#️-usage)
29
+ - [📁 Project Structure (v2.0)](#-project-structure-v20)
30
+ - [🛠️ Technologies and Models (v2.0)](#️-technologies-and-models-v20)
31
+ - [📜 Version History](#-version-history)
32
 
33
  ---
34
 
35
  ## 📖 Overview
36
 
37
+ Building upon the foundation of ReviewSense v1.0, which focused on extracting insights like sentiment, aspects, and summaries using multiple specialized models, **Version 2.0 introduces a significant upgrade: a conversational chatbot**.
 
 
38
 
39
+ This chatbot allows users to ask specific questions about product reviews and receive answers synthesized directly from the provided text. To achieve this efficiently and enhance overall capabilities, v2.0 consolidates the architecture around a single, powerful yet locally runnable Large Language Model (Mistral 7B GGUF). This unified model now handles both the batch analysis tasks (with improved quality) and the interactive Q&A, demonstrating a modern approach to building multi-functional NLP applications.
40
 
41
  ---
42
 
43
+ ## 🚀 What's New in v2.0
44
 
45
+ Version 2.0 represents a major leap in functionality and architecture:
 
46
 
47
+ 1. **🤖 RAG Chatbot Implementation:** Added an interactive chatbot (Phase 2) that uses Retrieval-Augmented Generation (RAG) to answer user questions based on review context.
48
+ 2. **🧠 Single LLM Architecture:** Replaced the multiple specialized models (DistilBERT, DistilBART, DeBERTa, POS Tagger) from v1.0 with a single, powerful Mistral 7B GGUF model, executed locally via `LlamaCpp`. This model now handles:
49
+ * Batch Analysis (Summary, Aspects, Sentiment - Phase 1) with higher quality.
50
+ * RAG-based Question Answering (Phase 2).
51
+ * Intent Classification (Guardrail for Phase 2).
52
+ 3. **📄 Dynamic Context Management:** The chatbot can now operate on a default set of reviews or dynamically update its knowledge base using user-uploaded `.txt` or `.csv` files.
53
+ 4. **💬 Conversational Memory:** Integrated LangChain's `ConversationBufferMemory` allowing the chatbot to understand follow-up questions.
54
+ 5. **🛡️ Intent Classification Guardrail:** Implemented a robust intent classifier (using the same LLM) to prevent the chatbot from answering off-topic questions, ensuring responses stay grounded in product reviews.
55
+ 6. **🖥️ Unified Gradio UI:** Developed a two-tab Gradio interface (`app.py`) providing access to both the Batch Analyzer and the RAG Chatbot in a single application.
56
+ 7. **💻 Local Execution Script:** Added `main.py` for command-line execution of batch analysis or interactive chat without the Gradio UI.
57
+ 8. **🧱 Modular Code Structure:** Refactored code into `src/pipeline.py` for core logic, improving organization and maintainability.
58
+
59
+ ---
60
 
61
+ ## Key Features (v2.0)
 
62
 
63
+ Includes all features from v1.0 (now powered by Mistral 7B) **plus**:
 
64
 
65
+ - **Interactive RAG Chatbot:**
66
+ * Ask specific questions about product reviews (e.g., "How is the battery life?", "Is the app reliable?").
67
+ * Answers synthesized directly from provided review context using RAG.
68
+ * **Conversational Memory:** Understands follow-up questions ("What about the screen?").
69
+ * **Grounded Responses:** Designed to answer only based on the reviews provided.
70
+ * **Intent Guardrail:** Filters out and responds appropriately to off-topic questions.
71
+ - **Dynamic Context Loading:**
72
+ * Chatbot operates on default reviews or context loaded from user-uploaded files (`.txt`/`.csv`).
73
+ * Clear indication of the currently active context.
74
+ - **Unified LLM Backend:** All NLP tasks (analysis, Q&A, classification) handled by a single Mistral 7B GGUF model running locally.
75
+ - **Dual Interface:** Accessible via Gradio web UI (`app.py`) or command line (`main.py`).
76
 
77
  ---
78
 
79
+ ## 🧠 How It Works: The v2.0 Pipeline
80
+
81
+ **Phase 1: Batch Analysis (via `analyze_reviews_only` or `analyze_reviews_logic`)**
82
+ 1. User provides review text (paste or file).
83
+ 2. The text is passed to the Mistral LLM using three distinct prompts (Summarization, Aspect Extraction, Sentiment Analysis).
84
+ 3. The LLM generates the three analysis outputs.
85
+
86
+ **Phase 2: RAG Chatbot (via `ask_question_with_guardrail` or `get_chatbot_response`)**
87
+ 1. User asks a question.
88
+ 2. **Intent Classification:** The query is first sent to the Mistral LLM with the `intent_prompt` (few-shot) to classify it as "Product" or "Off-Topic". Robust parsing checks the LLM output.
89
+ 3. **Routing:**
90
+ * If "Off-Topic", a canned response is returned.
91
+ * If "Product", proceed to RAG.
92
+ 4. **Context Retrieval:** The user's question is used to query the current FAISS vector store (containing embeddings of the active review context) to retrieve the top `k` relevant review snippets.
93
+ 5. **Conversational Chain Execution (`ConversationalRetrievalChain`):**
94
+ * **Condense Question:** If there's chat history, the LLM uses `CONDENSE_QUESTION_PROMPT` to rephrase the current question into a standalone query.
95
+ * **RAG Generation:** The condensed question and retrieved context snippets are passed to the LLM with the strict `qa_prompt`. The LLM synthesizes an answer based *only* on the provided context.
96
+ * **Memory Update:** The question and final answer are added to the `ConversationBufferMemory`.
97
+ 6. **Response:** The synthesized answer is returned to the user.
98
 
99
+ ---
100
 
101
+ ## 🔧 Challenges & Limitations
 
102
 
103
+ Developing v2.0 involved significant experimentation and revealed several challenges:
104
 
105
+ 1. **Consistent Instruction Following:** While powerful, the Mistral 7B GGUF model sometimes struggled to consistently follow complex negative constraints or nuanced instructions in prompts, especially within the RAG chain. This led to:
106
+ * **Context Leakage:** Occasionally including irrelevant details from retrieved chunks (e.g., mentioning webcam when asked about battery).
107
+ * **Hallucination:** Making up information (e.g., mentioning "phone" for laptop battery, inventing prices or product names).
108
+ * **Over-Cautiousness:** Incorrectly stating "cannot find information" even when relevant details were present in the context, particularly for negative aspects (e.g., hardware issues).
109
+ * **Misinterpretation:** Failing to correctly understand the specific user question (e.g., "taste" vs. "type", comparison questions).
110
+ 2. **Prompt Engineering Complexity:** Finding the right prompt structure required extensive iteration. Simple prompts lacked control, while overly complex prompts sometimes confused the model. Few-shot prompting proved essential for reliable intent classification. Balancing strictness (for grounding) with flexibility (to allow synthesis) in the RAG prompt was difficult.
111
+ 3. **Intent Classification Brittleness:** Getting the LLM to output *only* the classification label required moving from zero-shot, to strict instructions, to few-shot examples, and finally adding robust parsing logic (`parse_intent`) to handle noisy LLM outputs reliably.
112
+ 4. **Performance:** Running the 7B parameter GGUF model on a CPU is significantly slower than using smaller models or GPU acceleration. Batch analysis and RAG responses take noticeable time (though acceptable for demonstration).
113
+ 5. **Evaluation Bottleneck:** Using external APIs (like OpenAI) for RAGAs evaluation can incur costs and hit rate limits. Using the local model for evaluation is free but slower and potentially less objective.
114
 
115
+ ---
 
116
 
117
+ ## 💡 Prompt Engineering Journey
118
+
119
+ Achieving the final, relatively stable performance required significant iteration on the prompts, particularly for the RAG chain (`qa_prompt`) and intent classification (`intent_prompt`).
120
+
121
+ **Intent Classification (`intent_prompt`):**
122
+
123
+ * Initial attempts with simple zero-shot prompts failed, with the model providing verbose, incorrect classifications.
124
+ * Adding strict formatting rules (`MUST BE EXACTLY...`) helped but wasn't sufficient.
125
+ * **Few-Shot Prompting** (providing explicit examples within the prompt) proved crucial for forcing the model to output the correct labels, although often with extra text.
126
+ * **Robust Parsing (`parse_intent`)** was added to reliably extract the core "Product" or "Off-Topic" keyword from the model's potentially noisy output.
127
+
128
+ **Final `intent_template`:**
129
+
130
+ ```python
131
+ intent_template = """
132
+ [INST]
133
+ **CRITICAL INSTRUCTION:** Classify the user's query into ONLY ONE of two categories: "Product" or "Off-Topic".
134
+ Your response MUST be EXACTLY "Product" or EXACTLY "Off-Topic".
135
+
136
+ **EXAMPLES:**
137
+ Query: How is the battery life?
138
+ Classification: Product
139
+ Query: What are the complaints about the screen?
140
+ Classification: Product
141
+ Query: Does it come in blue?
142
+ Classification: Product
143
+ Query: What is the capital of France?
144
+ Classification: Off-Topic
145
+ Query: Hello there
146
+ Classification: Off-Topic
147
+ Query: Who are you?
148
+ Classification: Off-Topic
149
+
150
+ **NOW CLASSIFY THIS QUERY:**
151
+ Query: {query}
152
+ [/INST]
153
+ Classification:"""
154
+ ```
155
 
156
+ **RAG Generation (`qa_system_prompt`):**
157
 
158
+ * Initial simple prompts led to significant hallucination and context leakage.
159
 
160
+ * Adding strict rules improved grounding but sometimes made the model overly cautious, failing to find information present in the context.
161
 
162
+ * Explicitly addressing failure modes (like comparisons) helped for those specific cases.
163
 
164
+ * Experimenting with different chain types (`stuff`, `map_reduce`, `refine`) showed limitations related to context window size and model instruction following. `stuff` with `ConversationalRetrievalChain` proved most practical.
165
 
166
+ **Final qa_system_prompt (within qa_prompt):**
167
 
168
+ ```python
169
+ # RAG System Prompt (qa_system_prompt)
170
+ qa_system_prompt = """[INST]You are a factual assistant providing answers based **only** on the customer reviews provided.
171
+ Your task is to answer the user's question concisely using information explicitly found in the 'CONTEXT' snippets below.
172
 
173
+ **CRITICAL RULES TO FOLLOW:**
174
+ 1. **STRICTLY Contextual:** Base your answer ENTIRELY and ONLY on the information within the 'CONTEXT' section. Do NOT use any prior knowledge or external information.
175
+ 2. **Direct & Relevant:** Answer ONLY the specific question asked. Do NOT include details from the context that are irrelevant to the question, even if they appear nearby.
176
+ 3. **Synthesize Concisely:** Combine relevant facts from potentially multiple snippets into a brief answer (usually 1-3 sentences). Do NOT quote long passages unless absolutely necessary.
177
+ 4. **No Comparisons Outside Context:** If the question asks to compare the product to something *not mentioned* in the reviews, state *only*: "Cannot compare based on the provided reviews." Do not add details about the product itself in this case.
178
+ 5. **Handle Missing Info Carefully:** If, after carefully reading the context, you genuinely cannot find any information relevant to the question, state *only*: "Based on the provided reviews, I cannot find information about that." Check thoroughly before using this response.
179
+ 6. **Factual Tone:** Do NOT apologize, express opinions, make recommendations, or use conversational filler. Just state the facts found in the reviews.
180
+
181
+ CONTEXT:
182
+ ---
183
+ {context}
184
+ ---
185
+
186
+ QUESTION: {question} [/INST]
187
+ Answer:"""
188
+ ```
189
+
190
+ This iterative process demonstrates the practical challenges and refinement needed when working with local LLMs in complex pipelines.
191
 
192
  ---
193
 
194
  ## 🔮 Future Improvements
195
 
196
+ * **RAG Evaluation**: Fully implement and integrate RAGAs (or TruLens) evaluation using the local LLM or a free tier API to get quantitative metrics on Faithfulness, Answer Relevancy, etc.
197
+
198
+ * **LLM Upgrade**: Experiment with larger or more advanced instruction-tuned models (e.g., Mixtral GGUF, Llama 3 70/8B Instruct GGUF, or API-based models like GPT-4/Claude 3) to achieve higher consistency in instruction following and synthesis.
199
+
200
+ * **Advanced Retrieval**: Explore more sophisticated retrieval techniques (e.g., HyDE, MultiQueryRetriever, Re-ranking) to improve the quality of context chunks passed to the LLM, potentially reducing generation errors.
201
 
202
+ * **Batch Processing for Analysis**: Re-implement batch processing for Phase 1 using techniques like `map_reduce` to handle large numbers of reviews that exceed the LLM's context window.
203
 
204
+ * **Error Handling & UI**: Add more granular error handling and user feedback in the Gradio UI (e.g., clearer messages if context loading fails).
205
+
206
+ * **Automated Testing**: Implement unit and integration tests using `pytest` for the core logic in `src/pipeline.py`.
207
 
208
  ---
209
 
210
  ## ⚙️ Setup and Installation
211
 
212
+ **1. Clone the Repository**
213
 
214
  ```bash
215
+ git clone [https://github.com/Deathshot78/ReviewSense.git](https://github.com/Deathshot78/ReviewSense.git) # Replace with your repo URL if different
216
  cd ReviewSense
217
  ```
218
 
219
+ **2. Install Required Packages**
220
 
221
  ```bash
222
  pip install -r requirements.txt
223
  ```
224
 
225
+ **3. Download LLM Model**
226
 
227
+ The scripts will attempt to download the Mistral-7B GGUF model (`mistral-7b-instruct-v0.1.Q4_K_M.gguf`, ~4.4GB) automatically via `wget` on the first run if it's not found in the root directory. You can also download it manually from [Hugging Face](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF) and place it in the project root.
228
 
229
+ ---
230
 
231
+ ## ▶️ Usage
232
 
233
+ **Web App (Gradio)**
 
 
234
 
235
+ Run the Gradio app:
236
 
237
  ```bash
238
+ python app.py
239
  ```
240
 
241
+ Access the interface in your browser
 
 
242
 
243
+ * **Tab 1 ("Batch Analyzer"):** Paste reviews or upload a file to perform Summary, Aspect Extraction, and Sentiment Analysis. This does not affect the chatbot context.
244
 
245
+ * **Tab 2 ("Ask a Question"):** Chat with the RAG bot. Use the file upload and "Update Chatbot Context" button within this tab to change the reviews the chatbot uses. Use "Reset Chatbot Context to Default" to revert to the built-in laptop reviews. Use "Reset Chat Memory" to clear the conversation history.
 
 
246
 
247
  ---
248
 
249
+ ## 📁 Project Structure (v2.0)
250
 
251
  ```bash
252
+ ReviewSense/
253
+ ├── 📄 README.md # Project documentation
254
+ ├── 📁 src/ # Source code for core
255
+ ├── 📄 app.py # Gradio web application
256
+ ├── 📄 pipeline.py # Core functions for analysis, RAG, etc.
257
+ │ └── 📄 main.py # Command-line execution
258
+ ├── 📄 requirements.txt # Python dependencies
259
+ ├── 📄 .gitignore # Files ignored by Git
260
+ └── 🖼️ assets/ # images
261
  ```
262
 
263
  ---
264
 
265
+ ## 🛠️ Technologies and Models (v2.0)
266
 
267
  **Core Technologies**
268
 
269
+ * Python 3.10+
270
+
271
+ * LangChain: Orchestration, Chains (ConversationalRetrievalChain), Memory, Prompts
272
+
273
+ * llama-cpp-python: Local execution of GGUF models on CPU
274
 
275
+ * FAISS (faiss-cpu): Efficient vector similarity search
276
 
277
+ * Sentence-Transformers (all-MiniLM-L6-v2): Text embeddings
278
 
279
+ * Gradio: Interactive web UI
280
 
281
+ * PyTorch (dependency via transformers/sentence-transformers)
282
+
283
+ * Pandas, NumPy (standard data handling)
284
+
285
+ **Core LLM**
286
+
287
+ * Mistral 7B Instruct v0.1 (GGUF Q4_K_M): Used for all NLP tasks (Analysis, RAG Generation, Intent Classification). Downloaded from TheBloke on Hugging Face.
288
+
289
+ ---
290
 
291
+ ## 📜 Version History
292
 
293
+ * v2.0 (Current): RAG Chatbot, Single Mistral 7B model, Dynamic Context, Memory, Guardrails, Gradio UI, Code Refactoring.
294
 
295
+ * v1.0: [Link to v1.0 Release/Tag on GitHub, e.g., https://www.google.com/search?q=https://github.com/Deathshot78/ReviewSense/releases/tag/v1.0] - Initial Batch Analysis Engine using multiple specialized models (DistilBERT, DistilBART, etc.). Focused on Sentiment, Aspects, and Summarization. (See v1.0 README for full details).
assets/{confusion_bay.png → tab1.PNG} RENAMED
File without changes
assets/{confusion_bert.png → tab2.PNG} RENAMED
File without changes
assets/wordcloud.png DELETED

Git LFS Details

  • SHA256: 126af3f40232b435991a19a85329d4ed8bb25f239b393748c82bdd69c423082c
  • Pointer size: 130 Bytes
  • Size of remote file: 63.5 kB
checkpoints/sentiment-binary-best-checkpoint.ckpt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:b80c0f5882524f5859ac6a92f3311a2d8d4638bd6ef1236232fbb32057b43f3d
3
- size 803593979
 
 
 
 
notebooks/reviewsense.ipynb DELETED
@@ -1,1001 +0,0 @@
1
- {
2
- "cells": [
3
- {
4
- "cell_type": "markdown",
5
- "id": "1754f3bb",
6
- "metadata": {},
7
- "source": [
8
- "# 🛍️ ReviewSense: Product Review Analysis Engine\n",
9
- "\n",
10
- "> *ReviewSense is a comprehensive, end-to-end Natural Language Processing application built to extract deep, actionable insights from unstructured product reviews.* \n",
11
- "Where a simple star rating only tells part of the story, ReviewSense dives into the text to uncover what customers are saying, why they're saying it, and how they feel about specific product features. "
12
- ]
13
- },
14
- {
15
- "cell_type": "markdown",
16
- "id": "00d383d6",
17
- "metadata": {},
18
- "source": [
19
- "## Imports"
20
- ]
21
- },
22
- {
23
- "cell_type": "code",
24
- "execution_count": null,
25
- "id": "4d48ba17",
26
- "metadata": {},
27
- "outputs": [],
28
- "source": [
29
- "import pytorch_lightning as pl\n",
30
- "from torch.utils.data import DataLoader, Dataset\n",
31
- "from transformers import AutoTokenizer\n",
32
- "import pandas as pd\n",
33
- "from sklearn.model_selection import train_test_split\n",
34
- "import torch\n",
35
- "import os\n",
36
- "import numpy as np\n",
37
- "from sklearn.model_selection import train_test_split, ParameterGrid, StratifiedKFold\n",
38
- "from sklearn.feature_extraction.text import TfidfVectorizer\n",
39
- "from sklearn.naive_bayes import MultinomialNB\n",
40
- "from sklearn.pipeline import Pipeline\n",
41
- "from sklearn.metrics import accuracy_score, classification_report, confusion_matrix\n",
42
- "import seaborn as sns\n",
43
- "import matplotlib.pyplot as plt\n",
44
- "from tqdm.notebook import tqdm\n",
45
- "from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping\n",
46
- "from pytorch_lightning.loggers import TensorBoardLogger\n",
47
- "from transformers import T5ForConditionalGeneration, T5Tokenizer\n",
48
- "from transformers import AutoModelForSequenceClassification, get_linear_schedule_with_warmup, AutoConfig\n",
49
- "from torch.optim import AdamW\n",
50
- "import torch\n",
51
- "from torchmetrics.functional import accuracy\n",
52
- "from transformers import T5ForConditionalGeneration, T5Tokenizer, AutoTokenizer, pipeline\n",
53
- "\n"
54
- ]
55
- },
56
- {
57
- "cell_type": "markdown",
58
- "id": "8263bc02",
59
- "metadata": {},
60
- "source": [
61
- "## Prepare the data"
62
- ]
63
- },
64
- {
65
- "cell_type": "code",
66
- "execution_count": null,
67
- "id": "a5f8dcda",
68
- "metadata": {},
69
- "outputs": [],
70
- "source": [
71
- "def explore_and_preprocess_reviews(\n",
72
- " train_path='data/train.csv', \n",
73
- " test_path='data/test.csv',\n",
74
- " output_dir='data'\n",
75
- "):\n",
76
- " \"\"\"\n",
77
- " Loads the Amazon Sentiment Analysis dataset (https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews)\n",
78
- " (you need to extract the train/test splits from the zip file in the data folder),\n",
79
- " performs basic EDA, and preprocesses it for model training.\n",
80
- "\n",
81
- " Args:\n",
82
- " train_path (str): Path to the training CSV file.\n",
83
- " test_path (str): Path to the testing CSV file.\n",
84
- " output_dir (str): Directory to save the processed file.\n",
85
- " \"\"\"\n",
86
- " # --- 1. Load Data ---\n",
87
- " # This dataset typically comes without headers. We'll assign them.\n",
88
- " # Column 1: Sentiment (1 = Negative, 2 = Positive)\n",
89
- " # Column 2: Title\n",
90
- " # Column 3: Review Text\n",
91
- " print(f\"Loading data from '{train_path}' and '{test_path}'...\")\n",
92
- " try:\n",
93
- " col_names = ['sentiment_orig', 'title', 'review']\n",
94
- " train_df = pd.read_csv(train_path, header=None, names=col_names)\n",
95
- " test_df = pd.read_csv(test_path, header=None, names=col_names)\n",
96
- " \n",
97
- " # Combine for unified EDA and preprocessing\n",
98
- " df = pd.concat([train_df, test_df], ignore_index=True)\n",
99
- "\n",
100
- " except FileNotFoundError:\n",
101
- " print(f\"\\nERROR: Make sure '{train_path}' and '{test_path}' are in the specified directory.\")\n",
102
- " print(\"This script is designed for the 'Amazon Reviews for Sentiment Analysis' dataset from Kaggle.\")\n",
103
- " return\n",
104
- "\n",
105
- " df.dropna(inplace=True)\n",
106
- "\n",
107
- " # --- 2. Preprocessing ---\n",
108
- " print(\"\\n--- Preprocessing Data for Sentiment Analysis ---\")\n",
109
- "\n",
110
- " # a) Create new sentiment labels (0 = Negative, 1 = Positive)\n",
111
- " # This dataset is binary, not three-class like the previous one.\n",
112
- " df['sentiment'] = df['sentiment_orig'].apply(lambda x: 0 if x == 1 else 1)\n",
113
- "\n",
114
- " # b) Combine title and review body\n",
115
- " df['full_text'] = df['title'].astype(str) + \". \" + df['review'].astype(str)\n",
116
- "\n",
117
- " # c) Select and rename columns\n",
118
- " processed_df = df[['full_text', 'sentiment']].copy()\n",
119
- "\n",
120
- " # --- 4. Save Processed Data ---\n",
121
- " os.makedirs(output_dir, exist_ok=True)\n",
122
- " output_path = os.path.join(output_dir, 'reviews_processed.csv')\n",
123
- " processed_df.to_csv(output_path, index=False)\n",
124
- " print(f\"\\nSaved {len(processed_df)} processed reviews to '{output_path}'\")\n"
125
- ]
126
- },
127
- {
128
- "cell_type": "code",
129
- "execution_count": null,
130
- "id": "60ab838c",
131
- "metadata": {},
132
- "outputs": [],
133
- "source": [
134
- "#--- Preprocess the Reviews Dataset ---\n",
135
- "print(\"\\n--- Preprocessing started ---\")\n",
136
- "explore_and_preprocess_reviews()\n",
137
- "print(\"\\n--- Preprocessing finished ---\")"
138
- ]
139
- },
140
- {
141
- "cell_type": "markdown",
142
- "id": "4c381d73",
143
- "metadata": {},
144
- "source": [
145
- "## Define a base model (Multinomial Naive Bayes)"
146
- ]
147
- },
148
- {
149
- "cell_type": "code",
150
- "execution_count": null,
151
- "id": "b3cd2b5b",
152
- "metadata": {},
153
- "outputs": [],
154
- "source": [
155
- "def train_baseline_sentiment_model(data_path='data/reviews_processed.csv', grid_search=True, nb__alpha=0.1, tfidf__max_df=0.75, tfidf__ngram_range=(1, 2), sample_size: int = 50000):\n",
156
- " \"\"\"\n",
157
- " Trains and evaluates a Multinomial Naive Bayes model for sentiment analysis.\n",
158
- " Can optionally perform a grid search.\n",
159
- "\n",
160
- " Args:\n",
161
- " data_path (str): Path to the processed reviews CSV file.\n",
162
- " grid_search (bool): If True, performs a grid search.\n",
163
- " nb__alpha (float): Alpha for MultinomialNB.\n",
164
- " tfidf__max_df (float): max_df for TfidfVectorizer.\n",
165
- " tfidf__ngram_range (tuple): ngram_range for TfidfVectorizer.\n",
166
- " sample_size (int, optional): Number of reviews to use. If None, uses all.\n",
167
- " \"\"\"\n",
168
- " # --- 1. Load Data ---\n",
169
- " print(f\"Loading data from '{data_path}'...\")\n",
170
- " if not os.path.exists(data_path):\n",
171
- " print(f\"\\nERROR: '{data_path}' not found. Please run the EDA script first!\")\n",
172
- " return\n",
173
- " \n",
174
- " df = pd.read_csv(data_path)\n",
175
- " df.dropna(inplace=True)\n",
176
- "\n",
177
- " # --- 2. Sample Data ---\n",
178
- " if sample_size:\n",
179
- " print(f\"Using a sample of {sample_size} reviews for training the baseline model.\")\n",
180
- " df = df.sample(n=sample_size, random_state=42)\n",
181
- "\n",
182
- " # --- 3. Train-Test Split ---\n",
183
- " print(\"Splitting data into training and testing sets...\")\n",
184
- " X_train, X_test, y_train, y_test = train_test_split(\n",
185
- " df['full_text'],\n",
186
- " df['sentiment'],\n",
187
- " test_size=0.2,\n",
188
- " random_state=42,\n",
189
- " stratify=df['sentiment']\n",
190
- " )\n",
191
- "\n",
192
- " # --- 4. Create a Pipeline ---\n",
193
- " pipeline = Pipeline([\n",
194
- " ('tfidf', TfidfVectorizer(stop_words='english')),\n",
195
- " ('nb', MultinomialNB()),\n",
196
- " ])\n",
197
- "\n",
198
- " best_params = None\n",
199
- "\n",
200
- " if grid_search:\n",
201
- " # --- 5a. Perform Grid Search ---\n",
202
- " print(\"Performing Grid Search to find the best hyperparameters...\")\n",
203
- " parameters = {\n",
204
- " 'tfidf__ngram_range': [(1, 1), (1, 2)],\n",
205
- " 'tfidf__max_df': [0.5, 0.75, 1.0],\n",
206
- " 'nb__alpha': [0.1, 0.5, 1.0],\n",
207
- " }\n",
208
- " param_grid = list(ParameterGrid(parameters))\n",
209
- " best_score = -1\n",
210
- "\n",
211
- " for params in tqdm(param_grid, desc=\"Grid Search Progress\"):\n",
212
- " pipeline.set_params(**params)\n",
213
- " pipeline.fit(X_train, y_train)\n",
214
- " score = pipeline.score(X_test, y_test)\n",
215
- " if score > best_score:\n",
216
- " best_score = score\n",
217
- " best_params = params\n",
218
- " \n",
219
- " print(f\"\\nBest score on test set: {best_score:.4f}\")\n",
220
- " print(\"Best parameters found:\")\n",
221
- " print(best_params)\n",
222
- "\n",
223
- " else:\n",
224
- " # --- 5b. Use provided hyperparameters ---\n",
225
- " print(\"Skipping grid search and using provided hyperparameters...\")\n",
226
- " best_params = {\n",
227
- " 'nb__alpha': nb__alpha,\n",
228
- " 'tfidf__max_df': tfidf__max_df,\n",
229
- " 'tfidf__ngram_range': tfidf__ngram_range\n",
230
- " }\n",
231
- "\n",
232
- " # --- 6. Train the Final Model ---\n",
233
- " print(\"\\nTraining final model...\")\n",
234
- " best_model = pipeline.set_params(**best_params)\n",
235
- " best_model.fit(X_train, y_train)\n",
236
- " print(\"Model training complete.\")\n",
237
- "\n",
238
- " # --- 7. Evaluate the Best Model ---\n",
239
- " print(\"\\n--- Model Evaluation ---\")\n",
240
- " y_pred = best_model.predict(X_test)\n",
241
- " \n",
242
- " accuracy = accuracy_score(y_test, y_pred)\n",
243
- " target_names = ['Negative', 'Positive']\n",
244
- " \n",
245
- " print(f\"Accuracy: {accuracy:.4f}\")\n",
246
- " print(\"\\nClassification Report:\")\n",
247
- " print(classification_report(y_test, y_pred, target_names=target_names))\n",
248
- " \n",
249
- " print(\"Confusion Matrix:\")\n",
250
- " cm = confusion_matrix(y_test, y_pred)\n",
251
- " plt.figure(figsize=(8, 6))\n",
252
- " sns.heatmap(cm, annot=True, fmt='d', cmap='Greens', \n",
253
- " xticklabels=target_names, yticklabels=target_names)\n",
254
- " plt.title('Confusion Matrix for Naive Bayes on Amazon Reviews')\n",
255
- " plt.xlabel('Predicted Label')\n",
256
- " plt.ylabel('True Label')\n",
257
- " plt.show()"
258
- ]
259
- },
260
- {
261
- "cell_type": "code",
262
- "execution_count": null,
263
- "id": "093e6ae9",
264
- "metadata": {},
265
- "outputs": [],
266
- "source": [
267
- "#--- Train the base model ---\n",
268
- "train_baseline_sentiment_model(sample_size=150000, grid_search=False)"
269
- ]
270
- },
271
- {
272
- "cell_type": "markdown",
273
- "id": "71f5e4ba",
274
- "metadata": {},
275
- "source": [
276
- "## Define the dataset and lightning DataModule"
277
- ]
278
- },
279
- {
280
- "cell_type": "code",
281
- "execution_count": null,
282
- "id": "c977e0f4",
283
- "metadata": {},
284
- "outputs": [],
285
- "source": [
286
- "class ReviewDataset(Dataset):\n",
287
- " \"\"\"\n",
288
- " Custom PyTorch Dataset for Amazon Reviews.\n",
289
- "\n",
290
- " This class takes a pandas DataFrame of review data, a tokenizer, and a max\n",
291
- " token length, and prepares it for use in a PyTorch model. It handles the\n",
292
- " tokenization of the text and the formatting of the labels for each item.\n",
293
- "\n",
294
- " Attributes:\n",
295
- " tokenizer: The Hugging Face tokenizer to use for processing text.\n",
296
- " data (pd.DataFrame): The DataFrame containing the review data.\n",
297
- " max_token_len (int): The maximum sequence length for the tokenizer.\n",
298
- " \"\"\"\n",
299
- " def __init__(self, data: pd.DataFrame, tokenizer, max_token_len: int):\n",
300
- " \"\"\"\n",
301
- " Initializes the ReviewDataset.\n",
302
- "\n",
303
- " Args:\n",
304
- " data (pd.DataFrame): The input DataFrame containing 'full_text' and\n",
305
- " 'sentiment' columns.\n",
306
- " tokenizer: The pre-trained tokenizer instance.\n",
307
- " max_token_len (int): The maximum length for tokenized sequences.\n",
308
- " \"\"\"\n",
309
- " self.tokenizer = tokenizer\n",
310
- " self.data = data\n",
311
- " self.max_token_len = max_token_len\n",
312
- "\n",
313
- " def __len__(self):\n",
314
- " \"\"\"\n",
315
- " Returns the total number of samples in the dataset.\n",
316
- " \"\"\"\n",
317
- " return len(self.data)\n",
318
- "\n",
319
- " def __getitem__(self, index: int):\n",
320
- " \"\"\"\n",
321
- " Retrieves one sample from the dataset at the specified index.\n",
322
- "\n",
323
- " This method handles the tokenization of a single review text, including\n",
324
- " padding and truncation, and formats the output into a dictionary of\n",
325
- " tensors ready for the model.\n",
326
- "\n",
327
- " Args:\n",
328
- " index (int): The index of the data sample to retrieve.\n",
329
- "\n",
330
- " Returns:\n",
331
- " dict: A dictionary containing the tokenized inputs and the label,\n",
332
- " with the following keys:\n",
333
- " - 'input_ids': The token IDs of the review text.\n",
334
- " - 'attention_mask': The attention mask for the review text.\n",
335
- " - 'labels': The sentiment label as a tensor.\n",
336
- " \"\"\"\n",
337
- " data_row = self.data.iloc[index]\n",
338
- " text = str(data_row.full_text)\n",
339
- " labels = data_row.sentiment\n",
340
- "\n",
341
- " encoding = self.tokenizer.encode_plus(\n",
342
- " text,\n",
343
- " add_special_tokens=True,\n",
344
- " max_length=self.max_token_len,\n",
345
- " return_token_type_ids=False,\n",
346
- " padding=\"max_length\",\n",
347
- " truncation=True,\n",
348
- " return_attention_mask=True,\n",
349
- " return_tensors='pt',\n",
350
- " )\n",
351
- "\n",
352
- " return dict(\n",
353
- " input_ids=encoding[\"input_ids\"].flatten(),\n",
354
- " attention_mask=encoding[\"attention_mask\"].flatten(),\n",
355
- " labels=torch.tensor(labels, dtype=torch.long)\n",
356
- " )\n",
357
- "\n",
358
- "class ReviewDataModule(pl.LightningDataModule):\n",
359
- " \"\"\"\n",
360
- " PyTorch Lightning DataModule to handle the Amazon Reviews dataset.\n",
361
- "\n",
362
- " This class encapsulates all the steps needed to process the data:\n",
363
- " loading, splitting, and creating PyTorch DataLoaders for training,\n",
364
- " validation, and testing. It allows for using a smaller random sample of the\n",
365
- " full dataset for faster experimentation.\n",
366
- "\n",
367
- " Attributes:\n",
368
- " data_path (str): Path to the processed CSV file.\n",
369
- " batch_size (int): The size of each data batch.\n",
370
- " max_token_len (int): The maximum sequence length for the tokenizer.\n",
371
- " tokenizer: The Hugging Face tokenizer instance.\n",
372
- " num_workers (int): The number of CPU cores to use for data loading.\n",
373
- " sample_size (int, optional): The number of samples to use. If None,\n",
374
- " the full dataset is used.\n",
375
- " \"\"\"\n",
376
- " def __init__(self, data_path: str, batch_size: int = 16, max_token_len: int = 256, model_name='distilbert-base-uncased', num_workers: int = 0, sample_size: int = None):\n",
377
- " \"\"\"\n",
378
- " Initializes the ReviewDataModule.\n",
379
- "\n",
380
- " Args:\n",
381
- " data_path (str): The path to the processed CSV data file.\n",
382
- " batch_size (int): The number of samples per batch.\n",
383
- " max_token_len (int): Maximum length of tokenized sequences.\n",
384
- " model_name (str): The name of the pre-trained model to use for the tokenizer.\n",
385
- " num_workers (int): Number of subprocesses to use for data loading.\n",
386
- " sample_size (int, optional): If specified, a random sample of this\n",
387
- " size will be used from the dataset.\n",
388
- " Defaults to None, which uses the full dataset.\n",
389
- " \"\"\"\n",
390
- " super().__init__()\n",
391
- " self.data_path = data_path\n",
392
- " self.batch_size = batch_size\n",
393
- " self.max_token_len = max_token_len\n",
394
- " self.tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
395
- " self.num_workers = num_workers\n",
396
- " self.sample_size = sample_size\n",
397
- " self.train_df = None\n",
398
- " self.val_df = None\n",
399
- " self.test_df = None\n",
400
- "\n",
401
- " def setup(self, stage=None):\n",
402
- " \"\"\"\n",
403
- " Loads and splits the data for training, validation, and testing.\n",
404
- "\n",
405
- " This method is called by PyTorch Lightning. It reads the CSV, handles\n",
406
- " missing values, optionally takes a random sample, and performs a\n",
407
- " stratified train-validation-test split. The indices of the resulting\n",
408
- " DataFrames are reset to prevent potential KeyErrors during data loading.\n",
409
- " \"\"\"\n",
410
- " df = pd.read_csv(self.data_path)\n",
411
- " df.dropna(inplace=True)\n",
412
- "\n",
413
- " # If a sample size is provided, sample the dataframe\n",
414
- " if self.sample_size:\n",
415
- " print(f\"Using a sample of {self.sample_size} reviews.\")\n",
416
- " df = df.sample(n=self.sample_size, random_state=42)\n",
417
- "\n",
418
- " # Stratified split to maintain label distribution\n",
419
- " train_val_df, self.test_df = train_test_split(df, test_size=0.1, random_state=42, stratify=df.sentiment)\n",
420
- " self.train_df, self.val_df = train_test_split(train_val_df, test_size=0.1, random_state=42, stratify=train_val_df.sentiment)\n",
421
- "\n",
422
- " # Reset indices to prevent KeyErrors\n",
423
- " self.train_df = self.train_df.reset_index(drop=True)\n",
424
- " self.val_df = self.val_df.reset_index(drop=True)\n",
425
- " self.test_df = self.test_df.reset_index(drop=True)\n",
426
- "\n",
427
- " print(f\"Size of training set: {len(self.train_df)}\")\n",
428
- " print(f\"Size of validation set: {len(self.val_df)}\")\n",
429
- " print(f\"Size of test set: {len(self.test_df)}\")\n",
430
- "\n",
431
- " def train_dataloader(self):\n",
432
- " \"\"\"Returns the DataLoader for the training set.\"\"\"\n",
433
- " return DataLoader(\n",
434
- " ReviewDataset(self.train_df, self.tokenizer, self.max_token_len),\n",
435
- " batch_size=self.batch_size,\n",
436
- " shuffle=True,\n",
437
- " num_workers=self.num_workers\n",
438
- " )\n",
439
- "\n",
440
- " def val_dataloader(self):\n",
441
- " \"\"\"Returns the DataLoader for the validation set.\"\"\"\n",
442
- " return DataLoader(\n",
443
- " ReviewDataset(self.val_df, self.tokenizer, self.max_token__len),\n",
444
- " batch_size=self.batch_size,\n",
445
- " num_workers=self.num_workers\n",
446
- " )\n",
447
- "\n",
448
- " def test_dataloader(self):\n",
449
- " \"\"\"Returns the DataLoader for the test set.\"\"\"\n",
450
- " return DataLoader(\n",
451
- " ReviewDataset(self.test_df, self.tokenizer, self.max_token_len),\n",
452
- " batch_size=self.batch_size,\n",
453
- " num_workers=self.num_workers\n",
454
- " )\n",
455
- " "
456
- ]
457
- },
458
- {
459
- "cell_type": "code",
460
- "execution_count": null,
461
- "id": "985ac47b",
462
- "metadata": {},
463
- "outputs": [],
464
- "source": [
465
- "# --- Configuration ---\n",
466
- "data_path = \"data/reviews_processed.csv\"\n",
467
- "BATCH_SIZE = 64\n",
468
- "MAX_TOKEN_LEN = 256\n",
469
- "\n",
470
- "print(\"Initializing ReviewDataModule...\")\n",
471
- "review_datamodule = ReviewDataModule(\n",
472
- " data_path=data_path,\n",
473
- " batch_size=BATCH_SIZE,\n",
474
- " max_token_len=MAX_TOKEN_LEN,\n",
475
- " model_name='distilbert-base-uncased',\n",
476
- " sample_size=100000 # Pass the sample size to the datamodule\n",
477
- ")\n",
478
- "review_datamodule.setup()\n",
479
- "\n",
480
- "# Fetch one batch from the training dataloader to inspect its contents\n",
481
- "print(\"\\n--- Fetching one batch from the training dataloader ---\")\n",
482
- "train_batch = next(iter(review_datamodule.train_dataloader()))\n",
483
- "\n",
484
- "print(\"\\n--- Example Batch ---\")\n",
485
- "print(f\"Input IDs shape: {train_batch['input_ids'].shape}\")\n",
486
- "print(f\"Attention Mask shape: {train_batch['attention_mask'].shape}\")\n",
487
- "print(f\"Labels: {train_batch['labels']}\")\n",
488
- "print(f\"Labels shape: {train_batch['labels'].shape}\")"
489
- ]
490
- },
491
- {
492
- "cell_type": "markdown",
493
- "id": "2c7781f4",
494
- "metadata": {},
495
- "source": [
496
- "## FineTune DistilBert"
497
- ]
498
- },
499
- {
500
- "cell_type": "code",
501
- "execution_count": null,
502
- "id": "d046b940",
503
- "metadata": {},
504
- "outputs": [],
505
- "source": [
506
- "class SentimentClassifier(pl.LightningModule):\n",
507
- " \"\"\"\n",
508
- " PyTorch Lightning module for the sentiment classification model.\n",
509
- " \"\"\"\n",
510
- " def __init__(self, model_name='distilbert-base-uncased', n_classes=2, learning_rate=2e-5, n_warmup_steps=0, n_training_steps=0, dropout_prob=0.2): # Added dropout\n",
511
- " super().__init__()\n",
512
- " self.save_hyperparameters()\n",
513
- "\n",
514
- " # Configure dropout\n",
515
- " config = AutoConfig.from_pretrained(model_name)\n",
516
- " config.hidden_dropout_prob = dropout_prob\n",
517
- " config.attention_probs_dropout_prob = dropout_prob\n",
518
- " config.num_labels = n_classes\n",
519
- "\n",
520
- " self.model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config)\n",
521
- "\n",
522
- " def forward(self, input_ids, attention_mask, labels=None):\n",
523
- " return self.model(\n",
524
- " input_ids=input_ids,\n",
525
- " attention_mask=attention_mask,\n",
526
- " labels=labels\n",
527
- " )\n",
528
- "\n",
529
- " def training_step(self, batch, batch_idx):\n",
530
- " output = self.forward(**batch)\n",
531
- " self.log(\"train_loss\", output.loss, prog_bar=True, logger=True)\n",
532
- " return output.loss\n",
533
- "\n",
534
- " def validation_step(self, batch, batch_idx):\n",
535
- " output = self.forward(**batch)\n",
536
- " preds = torch.argmax(output.logits, dim=1)\n",
537
- " val_acc = accuracy(preds, batch['labels'], task='binary')\n",
538
- " self.log(\"val_loss\", output.loss, prog_bar=True, logger=True)\n",
539
- " self.log(\"val_accuracy\", val_acc, prog_bar=True, logger=True)\n",
540
- " return output.loss\n",
541
- "\n",
542
- " def test_step(self, batch, batch_idx):\n",
543
- " output = self.forward(**batch)\n",
544
- " preds = torch.argmax(output.logits, dim=1)\n",
545
- " test_acc = accuracy(preds, batch['labels'], task='binary')\n",
546
- " self.log(\"test_accuracy\", test_acc)\n",
547
- " return test_acc\n",
548
- "\n",
549
- " def predict_step(self, batch, batch_idx, dataloader_idx=0):\n",
550
- " output = self.forward(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'])\n",
551
- " return torch.argmax(output.logits, dim=1)\n",
552
- "\n",
553
- " def configure_optimizers(self):\n",
554
- " optimizer = AdamW(self.parameters(), lr=self.hparams.learning_rate, weight_decay=0.01)\n",
555
- " scheduler = get_linear_schedule_with_warmup(\n",
556
- " optimizer,\n",
557
- " num_warmup_steps=self.hparams.n_warmup_steps,\n",
558
- " num_training_steps=self.hparams.n_training_steps\n",
559
- " )\n",
560
- " return dict(optimizer=optimizer, lr_scheduler=dict(scheduler=scheduler, interval='step'))\n"
561
- ]
562
- },
563
- {
564
- "cell_type": "code",
565
- "execution_count": null,
566
- "id": "b3a3708d",
567
- "metadata": {},
568
- "outputs": [],
569
- "source": [
570
- "def train_sentiment_model(data_path='data/reviews_processed.csv', model_name='distilbert-base-uncased', n_epochs=5, sample_size: int = None):\n",
571
- " \"\"\"\n",
572
- " Main function to train the sentiment analysis model on the Amazon Reviews dataset.\n",
573
- "\n",
574
- " Args:\n",
575
- " data_path (str): Path to the processed data file.\n",
576
- " model_name (str): Name of the transformer model to use.\n",
577
- " n_epochs (int): Maximum number of epochs for training.\n",
578
- " sample_size (int, optional): The number of reviews to use for training.\n",
579
- " If None, the full dataset is used.\n",
580
- " \"\"\"\n",
581
- " # --- 1. Hyperparameters ---\n",
582
- " BATCH_SIZE = 64\n",
583
- " MAX_TOKEN_LEN = 256\n",
584
- " LEARNING_RATE = 2e-5\n",
585
- " N_CLASSES = 2 # Negative, Positive\n",
586
- "\n",
587
- " # --- 2. Initialize DataModule ---\n",
588
- " print(\"Initializing ReviewDataModule...\")\n",
589
- " review_datamodule = ReviewDataModule(\n",
590
- " data_path=data_path,\n",
591
- " batch_size=BATCH_SIZE,\n",
592
- " max_token_len=MAX_TOKEN_LEN,\n",
593
- " model_name=model_name,\n",
594
- " sample_size=sample_size # Pass the sample size to the datamodule\n",
595
- " )\n",
596
- " review_datamodule.setup()\n",
597
- "\n",
598
- " n_training_steps = len(review_datamodule.train_dataloader()) * n_epochs\n",
599
- " n_warmup_steps = int(n_training_steps * 0.1)\n",
600
- "\n",
601
- " # --- 3. Initialize Model ---\n",
602
- " print(\"Initializing SentimentClassifier model...\")\n",
603
- " model = SentimentClassifier(\n",
604
- " model_name=model_name,\n",
605
- " n_classes=N_CLASSES,\n",
606
- " learning_rate=LEARNING_RATE,\n",
607
- " n_warmup_steps=n_warmup_steps,\n",
608
- " n_training_steps=n_training_steps\n",
609
- " )\n",
610
- "\n",
611
- " # --- 4. Configure Training Callbacks ---\n",
612
- " checkpoint_callback = ModelCheckpoint(\n",
613
- " dirpath=\"checkpoints\",\n",
614
- " filename=\"sentiment-binary-best-checkpoint\",\n",
615
- " save_top_k=1,\n",
616
- " verbose=True,\n",
617
- " monitor=\"val_loss\",\n",
618
- " mode=\"min\"\n",
619
- " )\n",
620
- " logger = TensorBoardLogger(\"lightning_logs\", name=\"sentiment-classifier-binary\")\n",
621
- " early_stopping_callback = EarlyStopping(monitor='val_loss', patience=2)\n",
622
- "\n",
623
- " # --- 5. Initialize Trainer ---\n",
624
- " print(\"Initializing PyTorch Lightning Trainer...\")\n",
625
- " trainer = pl.Trainer(\n",
626
- " logger=logger,\n",
627
- " callbacks=[checkpoint_callback, early_stopping_callback],\n",
628
- " max_epochs=n_epochs,\n",
629
- " accelerator='gpu' if torch.cuda.is_available() else 'cpu',\n",
630
- " devices=1,\n",
631
- " )\n",
632
- "\n",
633
- " # --- 6. Start Training ---\n",
634
- " print(f\"Starting training with {model_name} for up to {n_epochs} epochs...\")\n",
635
- " trainer.fit(model, review_datamodule)\n",
636
- "\n",
637
- " # --- 7. Evaluate on Test Set and Generate Confusion Matrix ---\n",
638
- " print(\"\\nTraining complete. Evaluating on the test set...\")\n",
639
- " trainer.test(model, datamodule=review_datamodule)\n",
640
- "\n",
641
- " predictions = trainer.predict(model, datamodule=review_datamodule)\n",
642
- " if predictions:\n",
643
- " all_preds = torch.cat(predictions).cpu().numpy()\n",
644
- " true_labels = review_datamodule.test_df.sentiment.to_numpy()\n",
645
- " target_names = ['Negative', 'Positive'] # Updated labels\n",
646
- "\n",
647
- " cm = confusion_matrix(true_labels, all_preds)\n",
648
- " plt.figure(figsize=(8, 6))\n",
649
- " sns.heatmap(cm, annot=True, fmt='d', cmap='YlGnBu',\n",
650
- " xticklabels=target_names, yticklabels=target_names)\n",
651
- " plt.title('Confusion Matrix for Sentiment Analysis')\n",
652
- " plt.xlabel('Predicted Label')\n",
653
- " plt.ylabel('True Label')\n",
654
- " plt.show()\n",
655
- "\n"
656
- ]
657
- },
658
- {
659
- "cell_type": "code",
660
- "execution_count": null,
661
- "id": "3dae58e3",
662
- "metadata": {},
663
- "outputs": [],
664
- "source": [
665
- "#--- Train DistilBert ---\n",
666
- "train_sentiment_model(data_path=data_path, sample_size=100000)"
667
- ]
668
- },
669
- {
670
- "cell_type": "markdown",
671
- "id": "ddbc7315",
672
- "metadata": {},
673
- "source": [
674
- "## Define the models"
675
- ]
676
- },
677
- {
678
- "cell_type": "code",
679
- "execution_count": null,
680
- "id": "85bd352b",
681
- "metadata": {},
682
- "outputs": [],
683
- "source": [
684
- "class ReviewSummarizer:\n",
685
- " \"\"\"\n",
686
- " A class to handle the summarization of product reviews using a pre-trained T5 model.\n",
687
- " \"\"\"\n",
688
- " def __init__(self, model_name='t5-small'):\n",
689
- " \"\"\"\n",
690
- " Initializes the summarizer with a pre-trained T5 model and tokenizer.\n",
691
- "\n",
692
- " Args:\n",
693
- " model_name (str): The name of the pre-trained T5 model to use.\n",
694
- " \"\"\"\n",
695
- " print(f\"Loading summarization model: {model_name}...\")\n",
696
- " self.model_name = model_name\n",
697
- " self.device = 'cuda' if torch.cuda.is_available() else 'cpu'\n",
698
- "\n",
699
- " # Load the tokenizer and model from Hugging Face\n",
700
- " self.tokenizer = T5Tokenizer.from_pretrained(self.model_name)\n",
701
- " self.model = T5ForConditionalGeneration.from_pretrained(self.model_name).to(self.device)\n",
702
- " print(\"Summarization model loaded successfully.\")\n",
703
- "\n",
704
- " def summarize(self, text: str, max_length: int = 50, min_length: int = 10) -> str:\n",
705
- " \"\"\"\n",
706
- " Generates a summary for a given text.\n",
707
- "\n",
708
- " Args:\n",
709
- " text (str): The review text to summarize.\n",
710
- " max_length (int): The maximum length of the generated summary.\n",
711
- " min_length (int): The minimum length of the generated summary.\n",
712
- "\n",
713
- " Returns:\n",
714
- " str: The generated summary.\n",
715
- " \"\"\"\n",
716
- " if not text or not isinstance(text, str):\n",
717
- " return \"\"\n",
718
- "\n",
719
- " # T5 models require a prefix for the task. For summarization, it's \"summarize: \"\n",
720
- " preprocess_text = f\"summarize: {text.strip()}\"\n",
721
- "\n",
722
- " # Tokenize the input text\n",
723
- " tokenized_text = self.tokenizer.encode(preprocess_text, return_tensors=\"pt\").to(self.device)\n",
724
- "\n",
725
- " # Generate the summary\n",
726
- " summary_ids = self.model.generate(\n",
727
- " tokenized_text,\n",
728
- " max_length=max_length,\n",
729
- " min_length=min_length,\n",
730
- " length_penalty=2.0,\n",
731
- " num_beams=4,\n",
732
- " early_stopping=True\n",
733
- " )\n",
734
- "\n",
735
- " # Decode the summary and return it\n",
736
- " summary = self.tokenizer.decode(summary_ids[0], skip_special_tokens=True)\n",
737
- " return summary\n",
738
- "\n",
739
- "class AspectAnalyzer:\n",
740
- " \"\"\"\n",
741
- " A class to handle Aspect-Based Sentiment Analysis (ABSA) using a pre-trained model.\n",
742
- " \"\"\"\n",
743
- " # Changed to a different, currently valid lightweight model for ABSA.\n",
744
- " def __init__(self, model_name='yangheng/deberta-v3-base-absa-v1.1', force_cpu=False):\n",
745
- " \"\"\"\n",
746
- " Initializes the ABSA pipeline with a pre-trained model.\n",
747
- "\n",
748
- " Args:\n",
749
- " model_name (str): The name of the pre-trained ABSA model.\n",
750
- " force_cpu (bool): If True, forces the model to run on the CPU.\n",
751
- " \"\"\"\n",
752
- " print(f\"Loading Aspect-Based Sentiment Analysis model: {model_name}...\")\n",
753
- " self.model_name = model_name\n",
754
- "\n",
755
- " if force_cpu:\n",
756
- " self.device = -1 # Use -1 for CPU in pipeline\n",
757
- " print(\"Forcing ABSA model to run on CPU.\")\n",
758
- " else:\n",
759
- " self.device = 0 if torch.cuda.is_available() else -1\n",
760
- "\n",
761
- " print(f\"Using device: {self.device} (0 for GPU, -1 for CPU)\")\n",
762
- "\n",
763
- " self.absa_pipeline = pipeline(\n",
764
- " \"text-classification\",\n",
765
- " model=self.model_name,\n",
766
- " tokenizer=self.model_name,\n",
767
- " device=self.device\n",
768
- " )\n",
769
- " print(\"ABSA model loaded successfully.\")\n",
770
- "\n",
771
- " def analyze(self, text: str, aspects: list) -> dict:\n",
772
- " \"\"\"\n",
773
- " Analyzes the sentiment towards a list of aspects within a given text.\n",
774
- " \"\"\"\n",
775
- " if not text or not isinstance(text, str) or not aspects:\n",
776
- " return {}\n",
777
- "\n",
778
- " # The model expects the review and aspect separated by a special token.\n",
779
- " # Note: Different ABSA models might expect different input formats.\n",
780
- " # This format is common but may need adjustment for other models.\n",
781
- " inputs = [f\"{text} [SEP] {aspect}\" for aspect in aspects]\n",
782
- " results = self.absa_pipeline(inputs)\n",
783
- "\n",
784
- " # Process results into a user-friendly dictionary\n",
785
- " aspect_sentiments = {}\n",
786
- " for aspect, result in zip(aspects, results):\n",
787
- " aspect_sentiments[aspect] = {'sentiment': result['label'], 'score': result['score']}\n",
788
- "\n",
789
- " return aspect_sentiments\n",
790
- "\n",
791
- "class FineTunedSentimentClassifier:\n",
792
- " \"\"\"\n",
793
- " This class handles loading the fine-tuned checkpoint and making predictions.\n",
794
- " \"\"\"\n",
795
- " def __init__(self, checkpoint_path, model_name='distilbert-base-uncased', force_cpu=False):\n",
796
- " self.device = 'cpu' if force_cpu else ('cuda' if torch.cuda.is_available() else 'cpu')\n",
797
- " print(f\"Loading fine-tuned sentiment model from checkpoint: {checkpoint_path}...\")\n",
798
- " print(f\"Using device: {self.device}\")\n",
799
- "\n",
800
- " self.model = SentimentClassifier.load_from_checkpoint(checkpoint_path, map_location=self.device)\n",
801
- " self.model.to(self.device)\n",
802
- " self.model.eval() # Set model to evaluation mode\n",
803
- "\n",
804
- " self.tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
805
- " self.labels = ['NEGATIVE', 'POSITIVE']\n",
806
- " print(\"Fine-tuned sentiment model loaded successfully.\")\n",
807
- "\n",
808
- " def classify(self, text: str) -> dict:\n",
809
- " encoding = self.tokenizer.encode_plus(\n",
810
- " text, add_special_tokens=True, max_length=128,\n",
811
- " return_token_type_ids=False, padding=\"max_length\",\n",
812
- " truncation=True, return_attention_mask=True, return_tensors='pt',\n",
813
- " )\n",
814
- " input_ids = encoding[\"input_ids\"].to(self.device)\n",
815
- " attention_mask = encoding[\"attention_mask\"].to(self.device)\n",
816
- " with torch.no_grad():\n",
817
- " outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)\n",
818
- " logits = outputs.logits\n",
819
- " probabilities = torch.softmax(logits, dim=1)\n",
820
- " prediction_idx = torch.argmax(probabilities, dim=1).item()\n",
821
- " return {'label': self.labels[prediction_idx], 'score': probabilities[0][prediction_idx].item()}\n",
822
- "\n",
823
- "class AspectExtractor:\n",
824
- " \"\"\"\n",
825
- " This class uses a Part-of-Speech (POS) tagging model to first extract all\n",
826
- " potential aspect terms (nouns) from a review text. It then filters these\n",
827
- " nouns against a pre-defined dictionary of valid aspects for a given\n",
828
- " product category to return only the relevant features.\n",
829
- " \"\"\"\n",
830
- " def __init__(self, model_name=\"vblagoje/bert-english-uncased-finetuned-pos\", force_cpu=False):\n",
831
- " self.model_name = model_name\n",
832
- " self.device = 'cpu' if force_cpu else ('cuda' if torch.cuda.is_available() else 'cpu')\n",
833
- " print(f\"Loading Part-of-Speech (POS) tagging model: {self.model_name}...\")\n",
834
- " print(f\"Using device: {self.device}\")\n",
835
- "\n",
836
- " self.pipeline = pipeline(\n",
837
- " \"token-classification\",\n",
838
- " model=self.model_name,\n",
839
- " device=-1 if self.device == 'cpu' else 0,\n",
840
- " aggregation_strategy=\"simple\"\n",
841
- " )\n",
842
- " print(\"POS tagging model loaded successfully.\")\n",
843
- "\n",
844
- " def extract(self, text: str, aspect_dictionary: list) -> list:\n",
845
- " \"\"\"\n",
846
- " Extracts aspects from the given text that are present in the provided\n",
847
- " aspect dictionary.\n",
848
- "\n",
849
- " Args:\n",
850
- " text (str): The review text to analyze.\n",
851
- " aspect_dictionary (list): A list of valid, known aspects for the\n",
852
- " product category.\n",
853
- "\n",
854
- " Returns:\n",
855
- " list: A list of aspects that were both found in the text and are\n",
856
- " present in the aspect dictionary.\n",
857
- " \"\"\"\n",
858
- " if not text or not aspect_dictionary:\n",
859
- " return []\n",
860
- "\n",
861
- " # 1. Extract all nouns from the text using the POS model\n",
862
- " model_outputs = self.pipeline(text)\n",
863
- " noun_tags = {'NOUN', 'PROPN'}\n",
864
- " extracted_nouns = {\n",
865
- " output['word'].lower() for output in model_outputs\n",
866
- " if output['entity_group'] in noun_tags\n",
867
- " }\n",
868
- "\n",
869
- " # 2. Filter the extracted nouns against the provided dictionary\n",
870
- " # We find the intersection between the two sets.\n",
871
- " valid_aspects = {aspect.lower() for aspect in aspect_dictionary}\n",
872
- "\n",
873
- " final_aspects = list(extracted_nouns.intersection(valid_aspects))\n",
874
- "\n",
875
- " return final_aspects\n",
876
- " "
877
- ]
878
- },
879
- {
880
- "cell_type": "code",
881
- "execution_count": null,
882
- "id": "6fc21c8b",
883
- "metadata": {},
884
- "outputs": [],
885
- "source": [
886
- "# --- Configuration ---\n",
887
- "# --- IMPORTANT: UPDATE THIS PATH ---\n",
888
- "# You need to provide the path to the best checkpoint file that was saved\n",
889
- "# during the training of your sentiment model.\n",
890
- "SENTIMENT_CHECKPOINT_PATH = \"checkpoints/sentiment-binary-best-checkpoint.ckpt\"\n",
891
- "\n",
892
- "# --- Pre-defined Aspect Dictionaries for Different Product Categories ---\n",
893
- "ASPECT_DICTIONARIES = {\n",
894
- " \"Phone\": ['camera', 'battery', 'battery life', 'screen', 'performance', 'price', 'design'],\n",
895
- " \"Coffee Maker\": ['ease of use', 'design', 'noise level', 'coffee quality', 'brew time', 'cleaning'],\n",
896
- " \"Book\": ['plot', 'characters', 'writing style', 'pacing', 'ending'],\n",
897
- " \"Default\": ['quality', 'price', 'service', 'design', 'features'] # A fallback list\n",
898
- "}\n",
899
- "\n",
900
- "def main():\n",
901
- " \"\"\"\n",
902
- " Main function to run the command-line review analysis tool.\n",
903
- " \"\"\"\n",
904
- " # --- 1. Load All Models ---\n",
905
- " print(\"--- Initializing all models ---\")\n",
906
- " sentiment_classifier, summarizer, aspect_analyzer, aspect_extractor = None, None, None, None\n",
907
- " try:\n",
908
- " summarizer = ReviewSummarizer(force_cpu=True)\n",
909
- " aspect_analyzer = AspectAnalyzer(force_cpu=True)\n",
910
- " aspect_extractor = AspectExtractor(force_cpu=True)\n",
911
- "\n",
912
- " if not os.path.exists(SENTIMENT_CHECKPOINT_PATH):\n",
913
- " print(\"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\")\n",
914
- " print(\"!!! WARNING: Sentiment checkpoint path not found or not set. !!!\")\n",
915
- " print(f\"!!! Please update the 'SENTIMENT_CHECKPOINT_PATH' variable in main.py\")\n",
916
- " print(\"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\")\n",
917
- " else:\n",
918
- " sentiment_classifier = FineTunedSentimentClassifier(\n",
919
- " checkpoint_path=SENTIMENT_CHECKPOINT_PATH, force_cpu=True\n",
920
- " )\n",
921
- " print(\"\\n--- All models loaded successfully ---\\n\")\n",
922
- " except Exception as e:\n",
923
- " print(f\"An error occurred during model initialization: {e}\")\n",
924
- " return\n",
925
- "\n",
926
- " # --- 2. Interactive Loop ---\n",
927
- " while True:\n",
928
- " print(\"\\n==================================================\")\n",
929
- " print(\" Product Review Analysis Tool \")\n",
930
- " print(\"==================================================\")\n",
931
- "\n",
932
- " # Get user input\n",
933
- " review_text = input(\"Enter the product review text (or type 'quit' to exit):\\n> \")\n",
934
- " if review_text.lower() == 'quit':\n",
935
- " break\n",
936
- "\n",
937
- " print(\"\\nAvailable Product Categories:\")\n",
938
- " for i, category in enumerate(ASPECT_DICTIONARIES.keys(), 1):\n",
939
- " print(f\"{i}. {category}\")\n",
940
- "\n",
941
- " category_choice = input(f\"Select a product category (1-{len(ASPECT_DICTIONARIES)}):\\n> \")\n",
942
- " try:\n",
943
- " category_idx = int(category_choice) - 1\n",
944
- " product_category = list(ASPECT_DICTIONARIES.keys())[category_idx]\n",
945
- " except (ValueError, IndexError):\n",
946
- " print(\"Invalid choice. Using 'Default' category.\")\n",
947
- " product_category = \"Default\"\n",
948
- "\n",
949
- " # --- 3. Run Analysis ---\n",
950
- " print(\"\\n--- Analyzing Review... ---\")\n",
951
- "\n",
952
- " # a. Overall Sentiment\n",
953
- " sentiment_result = sentiment_classifier.classify(review_text)\n",
954
- "\n",
955
- " # b. Summary\n",
956
- " summary_result = summarizer.summarize(review_text)\n",
957
- "\n",
958
- " # c. Aspect Extraction and Analysis\n",
959
- " aspect_dictionary = ASPECT_DICTIONARIES.get(product_category)\n",
960
- " extracted_aspects = aspect_extractor.extract(review_text, aspect_dictionary)\n",
961
- " aspect_results = None\n",
962
- " if extracted_aspects:\n",
963
- " aspect_results = aspect_analyzer.analyze(review_text, extracted_aspects)\n",
964
- "\n",
965
- " # --- 4. Display Results ---\n",
966
- " print(\"\\n-------------------- ANALYSIS RESULTS --------------------\")\n",
967
- " print(f\"\\n[ Overall Sentiment ]\")\n",
968
- " print(f\" - Sentiment: {sentiment_result['label']} (Score: {sentiment_result['score']:.2f})\")\n",
969
- "\n",
970
- " print(f\"\\n[ Generated Summary ]\")\n",
971
- " print(f\" - {summary_result}\")\n",
972
- "\n",
973
- " print(f\"\\n[ Detected Aspect Sentiments ]\")\n",
974
- " if aspect_results:\n",
975
- " for aspect, result in aspect_results.items():\n",
976
- " print(f\" - {aspect.title()}: {result['sentiment']} (Score: {result['score']:.2f})\")\n",
977
- " else:\n",
978
- " print(\" - No relevant aspects from the dictionary were detected in the review.\")\n",
979
- " print(\"----------------------------------------------------------\")\n"
980
- ]
981
- },
982
- {
983
- "cell_type": "code",
984
- "execution_count": null,
985
- "id": "71257428",
986
- "metadata": {},
987
- "outputs": [],
988
- "source": [
989
- "# --- Run the workflow ---\n",
990
- "main()"
991
- ]
992
- }
993
- ],
994
- "metadata": {
995
- "language_info": {
996
- "name": "python"
997
- }
998
- },
999
- "nbformat": 4,
1000
- "nbformat_minor": 5
1001
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
requirements.txt CHANGED
@@ -1,11 +1,16 @@
1
- torch==2.8.0
2
- transformers==4.56.1
3
- pytorch-lightning==2.5.5
4
- torchmetrics==1.8.2
5
- sentencepiece==0.2.1
 
 
 
 
 
6
  pandas==2.2.2
7
- scikit-learn==1.6.1
8
- gradio==5.44.1
9
- matplotlib==3.10.0
10
- seaborn==0.13.2
11
- wordcloud==1.9.4
 
1
+ langchain==0.3.27
2
+ langchain-community==0.3.31
3
+ gradio==5.49.1
4
+ llama_cpp_python==0.3.16
5
+ sentence-transformers==5.1.1
6
+ torch==2.8.0
7
+ transformers==4.57.1
8
+ faiss-cpu==1.12.0
9
+ ragas==0.3.7
10
+ openai==1.109.1
11
  pandas==2.2.2
12
+ datasets==4.0.0
13
+ numpy==2.0.2
14
+ accelerate==1.11.0
15
+ aiohttp==3.13.1
16
+ huggingface-hub==0.35.3
scripts/app.py CHANGED
@@ -1,163 +1,280 @@
 
 
1
  import gradio as gr
2
- import os
3
  import torch
4
- from transformers import AutoTokenizer
5
- import pandas as pd
6
- import re
7
-
8
- # --- IMPORTANT ---
9
- # This script assumes you have a 'models.py' file in the same directory
10
- # containing the definitions for all model and inference classes.
11
- try:
12
- from models import (
13
- ReviewSummarizer,
14
- AspectAnalyzer,
15
- AspectExtractor,
16
- FineTunedSentimentClassifier
17
- )
18
- except ImportError:
19
- print("CRITICAL ERROR: Make sure 'models.py' exists and contains the required classes.")
20
- # Define dummy classes if imports fail, so Gradio can at least launch with an error message.
21
- class ReviewSummarizer: pass
22
- class AspectAnalyzer: pass
23
- class AspectExtractor: pass
24
- class FineTunedSentimentClassifier: pass
25
-
26
- # --- Configuration ---
27
- # This should be the relative path to your checkpoint file within the repository.
28
- SENTIMENT_CHECKPOINT_PATH = "checkpoints/sentiment-binary-best-checkpoint.ckpt"
29
-
30
-
31
- # --- Pre-defined Aspect Dictionaries for Different Product Categories ---
32
- ASPECT_DICTIONARIES = {
33
- "Phone": ['camera', 'battery', 'battery life', 'screen', 'performance', 'price', 'design'],
34
- "Coffee Maker": ['ease of use', 'design', 'noise level', 'coffee quality', 'brew time', 'cleaning'],
35
- "Book": ['plot', 'characters', 'writing style', 'pacing', 'ending'],
36
- "Default": ['quality', 'price', 'service', 'design', 'features'] # A fallback list
37
- }
38
-
39
-
40
- # --- Load All Models (Global Objects) ---
41
- print("--- Initializing all models for the Gradio App ---")
42
- sentiment_classifier, summarizer, aspect_analyzer, aspect_extractor = None, None, None, None
43
- try:
44
- summarizer = ReviewSummarizer(force_cpu=True)
45
- aspect_analyzer = AspectAnalyzer(force_cpu=True)
46
- aspect_extractor = AspectExtractor(force_cpu=True)
47
-
48
- if os.path.exists(SENTIMENT_CHECKPOINT_PATH):
49
- sentiment_classifier = FineTunedSentimentClassifier(
50
- checkpoint_path=SENTIMENT_CHECKPOINT_PATH, force_cpu=True
51
- )
52
- else:
53
- print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
54
- print("!!! WARNING: Sentiment checkpoint path not found. !!!")
55
- print(f"!!! Path checked: '{SENTIMENT_CHECKPOINT_PATH}'")
56
- print("!!! The fine-tuned sentiment model will NOT be loaded. !!!")
57
- print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
58
- print("\n--- All models loaded successfully ---\n")
59
- except Exception as e:
60
- print(f"An error occurred during model initialization: {e}")
61
-
62
-
63
- # --- Define the Core Analysis Function ---
64
- def analyze_review(review_text, product_category):
65
- if not review_text:
66
- return {"ERROR": "Please enter a review."}, "", None
67
-
68
- # --- a. Overall Sentiment Analysis ---
69
- if sentiment_classifier:
70
- sentiment_result = sentiment_classifier.classify(review_text)
71
- sentiment_output = {
72
- sentiment_result['label']: f"{sentiment_result['score']:.2f}"
73
- }
74
- else:
75
- # **ROBUST ERROR HANDLING:** This prevents the app from crashing.
76
- # It returns a dictionary that the Gradio Label component can display.
77
- sentiment_output = {"Error: Model Not Loaded": 1.0}
78
 
79
- # --- b. Review Summarization ---
80
- if summarizer:
81
- summary_output = summarizer.summarize(review_text)
82
- else:
83
- summary_output = "ERROR: Summarizer model not loaded."
84
 
85
- # --- c. Dynamic Aspect Extraction & Analysis ---
86
- aspect_df = None
87
- if aspect_extractor and aspect_analyzer:
88
- aspect_dictionary = ASPECT_DICTIONARIES.get(product_category, ASPECT_DICTIONARIES["Default"])
89
- extracted_aspects = aspect_extractor.extract(review_text, aspect_dictionary=aspect_dictionary)
90
 
91
- if extracted_aspects:
92
- aspect_results = aspect_analyzer.analyze(review_text, extracted_aspects)
93
- aspect_df = pd.DataFrame([
94
- {'Aspect': aspect, 'Sentiment': result['sentiment'], 'Score': f"{result['score']:.2f}"}
95
- for aspect, result in aspect_results.items()
96
- ])
 
 
 
97
 
98
- return sentiment_output, summary_output, aspect_df
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
 
 
 
100
 
101
- # --- Build the Gradio Interface ---
102
- with gr.Blocks(theme=gr.themes.Soft()) as demo:
103
- gr.Markdown("# 🛍️ ReviewSense: Product Review Analysis Engine")
104
- gr.Markdown(
105
- "Enter a product review and select the product category. The tool will automatically "
106
- "detect relevant features and provide an overall sentiment score, a summary, and a "
107
- "breakdown of sentiment towards each feature."
108
- )
109
 
110
- with gr.Row():
111
- with gr.Column(scale=2):
112
- review_input = gr.Textbox(
113
- lines=10,
114
- label="Enter Product Review Here",
115
- placeholder="e.g., The camera is amazing, but the battery life is terrible..."
116
- )
117
- category_input = gr.Dropdown(
118
- choices=list(ASPECT_DICTIONARIES.keys()),
119
- label="Select Product Category",
120
- value="Phone"
121
- )
122
- analyze_button = gr.Button("Analyze Review", variant="primary")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
 
124
- with gr.Column(scale=1):
125
- gr.Markdown("### Overall Sentiment")
126
- sentiment_output = gr.Label()
 
 
127
 
128
- gr.Markdown("### Generated Summary")
129
- summary_output = gr.Textbox(lines=5, label="Summary", interactive=False)
130
 
131
- gr.Markdown("### Detected Aspect Sentiments")
132
- aspect_output = gr.DataFrame(headers=["Aspect", "Sentiment", "Score"], label="Aspects", interactive=False)
133
 
134
- # Connect the button to the function
135
- analyze_button.click(
136
- fn=analyze_review,
137
- inputs=[review_input, category_input],
138
- outputs=[sentiment_output, summary_output, aspect_output]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
139
  )
140
 
141
- gr.Examples(
142
- examples=[
143
- [
144
- "The camera on this phone is incredible, the pictures are professional quality. However, the battery life is a total disaster, it barely lasts half a day with light use. The screen is bright and responsive, which I love.",
145
- "Phone"
146
- ],
147
- [
148
- "I am absolutely in love with this coffee maker! It's incredibly easy to use, brews a perfect cup every single time, and the design looks fantastic on my countertop. It's also surprisingly quiet.",
149
- "Coffee Maker"
150
- ],
151
- [
152
- "An amazing story with characters that felt so real. The plot had me hooked from the first page, though I felt the ending was a bit rushed.",
153
- "Book"
154
- ]
155
- ],
156
- inputs=[review_input, category_input]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
157
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
158
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159
 
160
- # --- Launch the App ---
161
  if __name__ == "__main__":
162
- print("Launching Gradio App...")
163
- demo.launch()
 
1
+ # app.py
2
+
3
  import gradio as gr
 
4
  import torch
5
+ from langchain_community.embeddings import HuggingFaceEmbeddings
6
+ from langchain_community.llms import LlamaCpp
7
+ from langchain.memory import ConversationBufferMemory
8
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
9
+ from langchain.prompts import PromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate
10
+ import os
11
+ import io
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
+ # Import the logic functions from src
14
+ import src.pipeline as pipeline
 
 
 
15
 
16
+ # --- Global Objects & Setup ---
17
+ # (Most setup code remains here as it's needed globally for the app)
 
 
 
18
 
19
+ print("--- Starting App Setup ---")
20
+ # 1. Download Model File
21
+ model_name = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"
22
+ model_url = "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf"
23
+ if not os.path.exists(model_name):
24
+ print("Downloading model...")
25
+ os.system(f"wget {model_url}")
26
+ else:
27
+ print("Model already downloaded.")
28
 
29
+ # 2. Prepare Default Sample Data & Example Batch
30
+ print("Loading default reviews...")
31
+ default_reviews_text = """
32
+ This laptop is a beast! The M3 chip is incredibly fast, and the battery lasts a solid 10 hours of heavy use... (rest of laptop reviews) ...dongle life is real.
33
+ ---
34
+ I'm a student, and the battery life is a lifesaver... Highly recommend for college.
35
+ ---
36
+ The keyboard is a dream to type on... Bluetooth connection dropping...
37
+ ---
38
+ Video editing on this machine is flawless... price is very expensive...
39
+ ---
40
+ I bought this for travel... battery easily gets me through a 6-hour flight...
41
+ ---
42
+ Don't buy this if you need a lot of ports... only two USB-C ports...
43
+ """
44
+ default_reviews_list = [r.strip() for r in default_reviews_text.strip().split('---') if r.strip()]
45
 
46
+ example_batch = """
47
+ I'm absolutely blown away by the "NovaBlend Pro" blender!... (rest of blender example)... save your money.
48
+ """
49
 
50
+ # 3. Load Embedding Model, Text Splitter
51
+ print("Loading embedding model and text splitter...")
52
+ model_kwargs = {'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
53
+ embeddings = HuggingFaceEmbeddings(
54
+ model_name="sentence-transformers/all-MiniLM-L6-v2",
55
+ model_kwargs=model_kwargs
56
+ )
57
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=40)
58
 
59
+ # 4. Create Default Vector Store
60
+ print("Creating default FAISS vector store...")
61
+ default_vector_store = pipeline.create_vector_store_from_content(
62
+ "\n---\n".join(default_reviews_list), text_splitter, embeddings
63
+ )
64
+ if default_vector_store is None:
65
+ raise ValueError("Failed to create default vector store!")
66
+ print("Default vector store created successfully.")
67
+
68
+ # Global variable to hold the CURRENT vector store for the chatbot
69
+ # NOTE: Using a global like this works for simple Gradio apps but isn't
70
+ # robust for multiple users. Gradio state or session management is better
71
+ # for multi-user scenarios, but this keeps it simpler for now.
72
+ current_chatbot_vector_store = default_vector_store
73
+ current_context_source = "Default Laptop Reviews"
74
+
75
+ # 5. Load the LLM
76
+ print("Loading LLM (Mistral-7B GGUF)...")
77
+ llm = LlamaCpp(
78
+ model_path=model_name, n_gpu_layers=0, n_batch=512, n_ctx=4096,
79
+ f16_kv=True, temperature=0.0, max_tokens=512, verbose=False,
80
+ stop=["[/INST]", "User:", "Assistant:"]
81
+ )
82
+
83
+ # 6. Define All Prompts
84
+ print("Defining all prompts...")
85
+ # -- Phase 1 --
86
+ summary_template = """[INST] You are a helpful assistant... Reviews:\n{reviews} [/INST]\nConcise Summary:"""
87
+ summary_prompt = PromptTemplate(template=summary_template, input_variables=["reviews"])
88
+ aspect_template = """[INST] You are a helpful product analyst... Reviews:\n{reviews} [/INST]\nKey Pros and Cons:"""
89
+ aspect_prompt = PromptTemplate(template=aspect_template, input_variables=["reviews"])
90
+ sentiment_template = """[INST] You are a helpful sentiment analyst... Reviews:\n{reviews} [/INST]\nOverall Sentiment (Score 1-10):"""
91
+ sentiment_prompt = PromptTemplate(template=sentiment_template, input_variables=["reviews"])
92
+ # -- Phase 2 --
93
+ condense_question_template = """[INST] Given the following conversation... Follow Up Input: {question} [/INST]\nStandalone question:"""
94
+ CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(condense_question_template)
95
+ qa_system_prompt = """[INST]
96
+ You are a factual assistant that answers only using the provided product reviews.
97
+ If the reviews include partial or uncertain information, summarize what they say.
98
+ If there is no information at all about the user’s question, respond with:
99
+ "I'm sorry, there isn't enough information in the reviews to answer that."
100
+
101
+ Do not use or infer information about price, comparisons to other brands, or availability unless they are directly mentioned in the reviews.
102
+ Always include a short "Evidence:" sentence if you found relevant mentions.
103
+
104
+ Context:
105
+ {context}
106
+
107
+ User question:
108
+ {question}
109
+ [/INST]
110
+ """
111
+ qa_prompt = ChatPromptTemplate.from_messages([SystemMessagePromptTemplate.from_template(qa_system_prompt), HumanMessagePromptTemplate.from_template("Context:\n{context}\n\nQuestion:\n{question}\n\nHelpful Answer:")])
112
+ intent_template = """
113
+ [INST]
114
+ **CRITICAL INSTRUCTION:** Classify the user's query into ONLY ONE of two categories: "Product" or "Off-Topic".
115
+ Your response MUST be EXACTLY "Product" or EXACTLY "Off-Topic".
116
+
117
+ **EXAMPLES:**
118
+ Query: How is the battery life?
119
+ Classification: Product
120
+ Query: What are the complaints about the screen?
121
+ Classification: Product
122
+ Query: Does it come in blue?
123
+ Classification: Product
124
+ Query: What is the capital of France?
125
+ Classification: Off-Topic
126
+ Query: Hello there
127
+ Classification: Off-Topic
128
+ Query: Who are you?
129
+ Classification: Off-Topic
130
 
131
+ **NOW CLASSIFY THIS QUERY:**
132
+ Query: {query}
133
+ [/INST]
134
+ Classification:"""
135
+ intent_prompt = PromptTemplate(template=intent_template, input_variables=["query"])
136
 
137
+ # 7. Global Memory Object
138
+ chat_memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True, output_key='answer')
139
 
140
+ print("--- App Setup Complete ---")
 
141
 
142
+
143
+ # --- Gradio Helper Functions (Wrappers around pipeline logic) ---
144
+
145
+ def analyze_reviews_gradio_wrapper(review_text, review_file):
146
+ """Gradio wrapper for Phase 1 analysis."""
147
+ content = ""
148
+ if review_file is not None:
149
+ try:
150
+ if hasattr(review_file, 'name'): file_path = review_file.name; f=open(file_path, 'rb'); byte_content = f.read(); f.close()
151
+ else: byte_content = review_file
152
+ try: content = byte_content.decode('utf-8')
153
+ except UnicodeDecodeError: content = byte_content.decode('latin-1')
154
+ except Exception as e: return f"Error reading file: {e}", "", ""
155
+ if not content: return "Error: File empty", "", ""
156
+ elif review_text:
157
+ content = review_text
158
+ else:
159
+ return "Please paste reviews or upload a file.", "", ""
160
+
161
+ # Call the core logic function
162
+ return pipeline.analyze_reviews_logic(
163
+ content, llm, summary_prompt, aspect_prompt, sentiment_prompt
164
  )
165
 
166
+ def update_chatbot_context_gradio_wrapper(chatbot_file_upload):
167
+ """Gradio wrapper to update chatbot context."""
168
+ global current_chatbot_vector_store, current_context_source # Modify globals
169
+
170
+ if chatbot_file_upload is None:
171
+ return f"No file uploaded. Chatbot context remains: **{current_context_source}**."
172
+
173
+ print("Processing chatbot context file via Gradio...")
174
+ content = ""
175
+ file_name = "Uploaded File"
176
+ try:
177
+ if hasattr(chatbot_file_upload, 'name'):
178
+ file_path = chatbot_file_upload.name
179
+ file_name = os.path.basename(file_path)
180
+ with open(file_path, 'rb') as f: byte_content = f.read()
181
+ else: byte_content = chatbot_file_upload
182
+ try: content = byte_content.decode('utf-8')
183
+ except UnicodeDecodeError: content = byte_content.decode('latin-1')
184
+ except Exception as e: return f"Error reading file: {e}. Context not updated."
185
+ if not content: return "File empty. Context not updated."
186
+
187
+ # Call the core logic function to create the store
188
+ new_vector_store = pipeline.create_vector_store_from_content(content, text_splitter, embeddings)
189
+
190
+ if new_vector_store:
191
+ current_chatbot_vector_store = new_vector_store # Update global store
192
+ current_context_source = f"File: {file_name}"
193
+ status_message = f"Chatbot context updated using **{file_name}**."
194
+ print(status_message)
195
+ return status_message
196
+ else:
197
+ # If store creation failed, keep the old one
198
+ status_message = f"Error creating context from {file_name}. Chatbot context remains: **{current_context_source}**."
199
+ print(status_message)
200
+ return status_message
201
+
202
+
203
+ def chat_responder_gradio_wrapper(message, chat_history):
204
+ """Gradio wrapper for the chatbot response logic."""
205
+ # Pass necessary global objects to the core logic function
206
+ response = pipeline.get_chatbot_response(
207
+ message=message,
208
+ chat_memory=chat_memory,
209
+ vector_store=current_chatbot_vector_store, # Use the current global store
210
+ llm=llm,
211
+ intent_prompt=intent_prompt,
212
+ condense_prompt=CONDENSE_QUESTION_PROMPT,
213
+ qa_prompt=qa_prompt
214
  )
215
+ return response
216
+
217
+ def clear_chat_memory_gradio_wrapper():
218
+ """Gradio wrapper to clear memory."""
219
+ print("Clearing chat memory via Gradio button...")
220
+ chat_memory.clear()
221
+ print("Chat memory cleared.")
222
+ return [] # Return empty list to clear ChatInterface display
223
+
224
+ def reset_context_to_default_gradio_wrapper():
225
+ """Gradio wrapper to reset context to default."""
226
+ global current_chatbot_vector_store, current_context_source
227
+ print("Resetting context via Gradio button...")
228
+ current_chatbot_vector_store = default_vector_store
229
+ current_context_source = "Default Laptop Reviews"
230
+ status_msg = f"Chatbot context reset to **{current_context_source}**."
231
+ print(status_msg)
232
+ return status_msg
233
+
234
 
235
+ # --- Gradio UI Definition ---
236
+ with gr.Blocks(theme=gr.themes.Soft()) as demo:
237
+ gr.Markdown("# 🤖 Product Review Intelligence Center")
238
+ gr.Markdown("Analyze product reviews using Mistral-7B (Tab 1) or chat about reviews with customizable context (Tab 2).")
239
+
240
+ with gr.Tabs():
241
+ # --- TAB 1: BATCH ANALYZER ---
242
+ with gr.TabItem("Batch Analyzer"):
243
+ gr.Markdown("Paste reviews OR upload a file (.txt, .csv) to analyze them.")
244
+ gr.Markdown("**Note:** This analysis does *not* affect the chatbot's context in Tab 2.")
245
+ with gr.Row():
246
+ with gr.Column(scale=2):
247
+ review_input_text_tab1 = gr.Textbox(lines=15, placeholder="Paste reviews here...", label="Reviews Text Input")
248
+ review_input_file_tab1 = gr.File(label="Upload Reviews File (.txt, .csv)", file_types=[".txt", ".csv"])
249
+ with gr.Column(scale=1):
250
+ summary_output_tab1 = gr.Textbox(label="Overall Summary", lines=5, interactive=False)
251
+ aspect_output_tab1 = gr.Textbox(label="Key Aspects (Pros/Cons)", lines=5, interactive=False)
252
+ sentiment_output_tab1 = gr.Textbox(label="Sentiment Analysis", lines=5, interactive=False)
253
+ analyze_button_tab1 = gr.Button("Analyze Reviews")
254
+ gr.Examples(examples=[[example_batch, None]], inputs=[review_input_text_tab1, review_input_file_tab1], outputs=[summary_output_tab1, aspect_output_tab1, sentiment_output_tab1], fn=analyze_reviews_gradio_wrapper, cache_examples=False) # Use wrapper
255
+ analyze_button_tab1.click(fn=analyze_reviews_gradio_wrapper, inputs=[review_input_text_tab1, review_input_file_tab1], outputs=[summary_output_tab1, aspect_output_tab1, sentiment_output_tab1]) # Use wrapper
256
+
257
+ # --- TAB 2: CHAT ABOUT REVIEWS ---
258
+ with gr.TabItem("Ask a Question (Chatbot)"):
259
+ gr.Markdown("Ask specific questions about product reviews. Upload a file below to change the chatbot's knowledge base.")
260
+ chatbot_status_display = gr.Markdown(f"Chatbot is currently using: **{current_context_source}**")
261
+ with gr.Row():
262
+ chatbot_context_file = gr.File(label="Upload Chatbot Context File (.txt, .csv)", file_types=[".txt", ".csv"], scale=3)
263
+ update_context_button = gr.Button("Update Chatbot Context", scale=1)
264
+ chatbot_interface = gr.ChatInterface(
265
+ fn=chat_responder_gradio_wrapper, # Use wrapper
266
+ examples=["How is the battery life?", "What about the screen?", "What are the complaints about connectivity?", "What is the capital of France?"],
267
+ title="Review Chatbot"
268
+ )
269
+ with gr.Row():
270
+ reset_memory_button = gr.Button("🔄 Reset Chat Memory")
271
+ reset_context_button = gr.Button("🔄 Reset Chatbot Context to Default")
272
+ # Link actions to wrapper functions
273
+ update_context_button.click(fn=update_chatbot_context_gradio_wrapper, inputs=[chatbot_context_file], outputs=[chatbot_status_display])
274
+ reset_memory_button.click(fn=clear_chat_memory_gradio_wrapper, inputs=None, outputs=[chatbot_interface])
275
+ reset_context_button.click(fn=reset_context_to_default_gradio_wrapper, inputs=None, outputs=[chatbot_status_display])
276
 
277
+ # --- Launch Command ---
278
  if __name__ == "__main__":
279
+ chat_memory.clear() # Clear memory each time app starts
280
+ demo.launch(debug=True)
scripts/data_prepare.py DELETED
@@ -1,263 +0,0 @@
1
- import pytorch_lightning as pl
2
- from torch.utils.data import DataLoader, Dataset
3
- from transformers import AutoTokenizer
4
- import pandas as pd
5
- from sklearn.model_selection import train_test_split
6
- import torch
7
- import os
8
-
9
- def explore_and_preprocess_reviews(
10
- train_path='data/train.csv',
11
- test_path='data/test.csv',
12
- output_dir='data'
13
- ):
14
- """
15
- Loads the Amazon Sentiment Analysis dataset (https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews)
16
- (you need to extract the train/test splits from the zip file in the data folder),
17
- performs basic EDA, and preprocesses it for model training.
18
-
19
- Args:
20
- train_path (str): Path to the training CSV file.
21
- test_path (str): Path to the testing CSV file.
22
- output_dir (str): Directory to save the processed file.
23
- """
24
- # --- 1. Load Data ---
25
- # This dataset typically comes without headers. We'll assign them.
26
- # Column 1: Sentiment (1 = Negative, 2 = Positive)
27
- # Column 2: Title
28
- # Column 3: Review Text
29
- print(f"Loading data from '{train_path}' and '{test_path}'...")
30
- try:
31
- col_names = ['sentiment_orig', 'title', 'review']
32
- train_df = pd.read_csv(train_path, header=None, names=col_names)
33
- test_df = pd.read_csv(test_path, header=None, names=col_names)
34
-
35
- # Combine for unified EDA and preprocessing
36
- df = pd.concat([train_df, test_df], ignore_index=True)
37
-
38
- except FileNotFoundError:
39
- print(f"\nERROR: Make sure '{train_path}' and '{test_path}' are in the specified directory.")
40
- print("This script is designed for the 'Amazon Reviews for Sentiment Analysis' dataset from Kaggle.")
41
- return
42
-
43
- df.dropna(inplace=True)
44
-
45
- # --- 2. Preprocessing ---
46
- print("\n--- Preprocessing Data for Sentiment Analysis ---")
47
-
48
- # a) Create new sentiment labels (0 = Negative, 1 = Positive)
49
- # This dataset is binary, not three-class like the previous one.
50
- df['sentiment'] = df['sentiment_orig'].apply(lambda x: 0 if x == 1 else 1)
51
-
52
- # b) Combine title and review body
53
- df['full_text'] = df['title'].astype(str) + ". " + df['review'].astype(str)
54
-
55
- # c) Select and rename columns
56
- processed_df = df[['full_text', 'sentiment']].copy()
57
-
58
- # --- 4. Save Processed Data ---
59
- os.makedirs(output_dir, exist_ok=True)
60
- output_path = os.path.join(output_dir, 'reviews_processed.csv')
61
- processed_df.to_csv(output_path, index=False)
62
- print(f"\nSaved {len(processed_df)} processed reviews to '{output_path}'")
63
-
64
- class ReviewDataset(Dataset):
65
- """
66
- Custom PyTorch Dataset for Amazon Reviews.
67
-
68
- This class takes a pandas DataFrame of review data, a tokenizer, and a max
69
- token length, and prepares it for use in a PyTorch model. It handles the
70
- tokenization of the text and the formatting of the labels for each item.
71
-
72
- Attributes:
73
- tokenizer: The Hugging Face tokenizer to use for processing text.
74
- data (pd.DataFrame): The DataFrame containing the review data.
75
- max_token_len (int): The maximum sequence length for the tokenizer.
76
- """
77
- def __init__(self, data: pd.DataFrame, tokenizer, max_token_len: int):
78
- """
79
- Initializes the ReviewDataset.
80
-
81
- Args:
82
- data (pd.DataFrame): The input DataFrame containing 'full_text' and
83
- 'sentiment' columns.
84
- tokenizer: The pre-trained tokenizer instance.
85
- max_token_len (int): The maximum length for tokenized sequences.
86
- """
87
- self.tokenizer = tokenizer
88
- self.data = data
89
- self.max_token_len = max_token_len
90
-
91
- def __len__(self):
92
- """
93
- Returns the total number of samples in the dataset.
94
- """
95
- return len(self.data)
96
-
97
- def __getitem__(self, index: int):
98
- """
99
- Retrieves one sample from the dataset at the specified index.
100
-
101
- This method handles the tokenization of a single review text, including
102
- padding and truncation, and formats the output into a dictionary of
103
- tensors ready for the model.
104
-
105
- Args:
106
- index (int): The index of the data sample to retrieve.
107
-
108
- Returns:
109
- dict: A dictionary containing the tokenized inputs and the label,
110
- with the following keys:
111
- - 'input_ids': The token IDs of the review text.
112
- - 'attention_mask': The attention mask for the review text.
113
- - 'labels': The sentiment label as a tensor.
114
- """
115
- data_row = self.data.iloc[index]
116
- text = str(data_row.full_text)
117
- labels = data_row.sentiment
118
-
119
- encoding = self.tokenizer.encode_plus(
120
- text,
121
- add_special_tokens=True,
122
- max_length=self.max_token_len,
123
- return_token_type_ids=False,
124
- padding="max_length",
125
- truncation=True,
126
- return_attention_mask=True,
127
- return_tensors='pt',
128
- )
129
-
130
- return dict(
131
- input_ids=encoding["input_ids"].flatten(),
132
- attention_mask=encoding["attention_mask"].flatten(),
133
- labels=torch.tensor(labels, dtype=torch.long)
134
- )
135
-
136
- class ReviewDataModule(pl.LightningDataModule):
137
- """
138
- PyTorch Lightning DataModule to handle the Amazon Reviews dataset.
139
-
140
- This class encapsulates all the steps needed to process the data:
141
- loading, splitting, and creating PyTorch DataLoaders for training,
142
- validation, and testing. It allows for using a smaller random sample of the
143
- full dataset for faster experimentation.
144
-
145
- Attributes:
146
- data_path (str): Path to the processed CSV file.
147
- batch_size (int): The size of each data batch.
148
- max_token_len (int): The maximum sequence length for the tokenizer.
149
- tokenizer: The Hugging Face tokenizer instance.
150
- num_workers (int): The number of CPU cores to use for data loading.
151
- sample_size (int, optional): The number of samples to use. If None,
152
- the full dataset is used.
153
- """
154
- def __init__(self, data_path: str, batch_size: int = 16, max_token_len: int = 256, model_name='distilbert-base-uncased', num_workers: int = 0, sample_size: int = None):
155
- """
156
- Initializes the ReviewDataModule.
157
-
158
- Args:
159
- data_path (str): The path to the processed CSV data file.
160
- batch_size (int): The number of samples per batch.
161
- max_token_len (int): Maximum length of tokenized sequences.
162
- model_name (str): The name of the pre-trained model to use for the tokenizer.
163
- num_workers (int): Number of subprocesses to use for data loading.
164
- sample_size (int, optional): If specified, a random sample of this
165
- size will be used from the dataset.
166
- Defaults to None, which uses the full dataset.
167
- """
168
- super().__init__()
169
- self.data_path = data_path
170
- self.batch_size = batch_size
171
- self.max_token_len = max_token_len
172
- self.tokenizer = AutoTokenizer.from_pretrained(model_name)
173
- self.num_workers = num_workers
174
- self.sample_size = sample_size
175
- self.train_df = None
176
- self.val_df = None
177
- self.test_df = None
178
-
179
- def setup(self, stage=None):
180
- """
181
- Loads and splits the data for training, validation, and testing.
182
-
183
- This method is called by PyTorch Lightning. It reads the CSV, handles
184
- missing values, optionally takes a random sample, and performs a
185
- stratified train-validation-test split. The indices of the resulting
186
- DataFrames are reset to prevent potential KeyErrors during data loading.
187
- """
188
- df = pd.read_csv(self.data_path)
189
- df.dropna(inplace=True)
190
-
191
- # If a sample size is provided, sample the dataframe
192
- if self.sample_size:
193
- print(f"Using a sample of {self.sample_size} reviews.")
194
- df = df.sample(n=self.sample_size, random_state=42)
195
-
196
- # Stratified split to maintain label distribution
197
- train_val_df, self.test_df = train_test_split(df, test_size=0.1, random_state=42, stratify=df.sentiment)
198
- self.train_df, self.val_df = train_test_split(train_val_df, test_size=0.1, random_state=42, stratify=train_val_df.sentiment)
199
-
200
- # Reset indices to prevent KeyErrors
201
- self.train_df = self.train_df.reset_index(drop=True)
202
- self.val_df = self.val_df.reset_index(drop=True)
203
- self.test_df = self.test_df.reset_index(drop=True)
204
-
205
- print(f"Size of training set: {len(self.train_df)}")
206
- print(f"Size of validation set: {len(self.val_df)}")
207
- print(f"Size of test set: {len(self.test_df)}")
208
-
209
- def train_dataloader(self):
210
- """Returns the DataLoader for the training set."""
211
- return DataLoader(
212
- ReviewDataset(self.train_df, self.tokenizer, self.max_token_len),
213
- batch_size=self.batch_size,
214
- shuffle=True,
215
- num_workers=self.num_workers
216
- )
217
-
218
- def val_dataloader(self):
219
- """Returns the DataLoader for the validation set."""
220
- return DataLoader(
221
- ReviewDataset(self.val_df, self.tokenizer, self.max_token__len),
222
- batch_size=self.batch_size,
223
- num_workers=self.num_workers
224
- )
225
-
226
- def test_dataloader(self):
227
- """Returns the DataLoader for the test set."""
228
- return DataLoader(
229
- ReviewDataset(self.test_df, self.tokenizer, self.max_token_len),
230
- batch_size=self.batch_size,
231
- num_workers=self.num_workers
232
- )
233
-
234
- if __name__ == "__main__":
235
-
236
- #--- Step 1: Preprocess the Reviews Dataset ---
237
- print("\n--- Preprocessing started ---")
238
- explore_and_preprocess_reviews()
239
- print("\n--- Preprocessing finished ---")
240
- # --- Configuration ---
241
- data_path = "data/reviews_processed.csv"
242
- BATCH_SIZE = 64
243
- MAX_TOKEN_LEN = 256
244
-
245
- print("Initializing ReviewDataModule...")
246
- review_datamodule = ReviewDataModule(
247
- data_path=data_path,
248
- batch_size=BATCH_SIZE,
249
- max_token_len=MAX_TOKEN_LEN,
250
- model_name='distilbert-base-uncased',
251
- sample_size=100000 # Pass the sample size to the datamodule
252
- )
253
- review_datamodule.setup()
254
-
255
- # Fetch one batch from the training dataloader to inspect its contents
256
- print("\n--- Fetching one batch from the training dataloader ---")
257
- train_batch = next(iter(review_datamodule.train_dataloader()))
258
-
259
- print("\n--- Example Batch ---")
260
- print(f"Input IDs shape: {train_batch['input_ids'].shape}")
261
- print(f"Attention Mask shape: {train_batch['attention_mask'].shape}")
262
- print(f"Labels: {train_batch['labels']}")
263
- print(f"Labels shape: {train_batch['labels'].shape}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/main.py CHANGED
@@ -1,109 +1,212 @@
1
- import os
2
- import torch
3
- import pandas as pd
4
 
5
- try:
6
- from data_prepare import ReviewDataset, ReviewDataModule
7
- from models import SentimentClassifier, ReviewSummarizer, AspectAnalyzer, FineTunedSentimentClassifier, AspectExtractor
8
- except ImportError:
9
- print("CRITICAL ERROR: Make sure 'review_summarizer.py', 'aspect_extractor.py', and 'sentiment_classifier_model.py' are in the same directory.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  exit()
11
 
12
- # --- Configuration ---
13
- # --- IMPORTANT: UPDATE THIS PATH ---
14
- # You need to provide the path to the best checkpoint file that was saved
15
- # during the training of your sentiment model.
16
- SENTIMENT_CHECKPOINT_PATH = "checkpoints/sentiment-binary-best-checkpoint.ckpt"
17
-
18
- # --- Pre-defined Aspect Dictionaries for Different Product Categories ---
19
- ASPECT_DICTIONARIES = {
20
- "Phone": ['camera', 'battery', 'battery life', 'screen', 'performance', 'price', 'design'],
21
- "Coffee Maker": ['ease of use', 'design', 'noise level', 'coffee quality', 'brew time', 'cleaning'],
22
- "Book": ['plot', 'characters', 'writing style', 'pacing', 'ending'],
23
- "Default": ['quality', 'price', 'service', 'design', 'features'] # A fallback list
24
- }
25
-
26
- def main():
27
- """
28
- Main function to run the command-line review analysis tool.
29
- """
30
- # --- 1. Load All Models ---
31
- print("--- Initializing all models ---")
32
- sentiment_classifier, summarizer, aspect_analyzer, aspect_extractor = None, None, None, None
33
- try:
34
- summarizer = ReviewSummarizer(force_cpu=True)
35
- aspect_analyzer = AspectAnalyzer(force_cpu=True)
36
- aspect_extractor = AspectExtractor(force_cpu=True)
37
-
38
- if not os.path.exists(SENTIMENT_CHECKPOINT_PATH):
39
- print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
40
- print("!!! WARNING: Sentiment checkpoint path not found or not set. !!!")
41
- print(f"!!! Please update the 'SENTIMENT_CHECKPOINT_PATH' variable in main.py")
42
- print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
43
- else:
44
- sentiment_classifier = FineTunedSentimentClassifier(
45
- checkpoint_path=SENTIMENT_CHECKPOINT_PATH, force_cpu=True
46
- )
47
- print("\n--- All models loaded successfully ---\n")
48
- except Exception as e:
49
- print(f"An error occurred during model initialization: {e}")
50
- return
51
-
52
- # --- 2. Interactive Loop ---
53
- while True:
54
- print("\n==================================================")
55
- print(" Product Review Analysis Tool ")
56
- print("==================================================")
57
-
58
- # Get user input
59
- review_text = input("Enter the product review text (or type 'quit' to exit):\n> ")
60
- if review_text.lower() == 'quit':
61
- break
62
-
63
- print("\nAvailable Product Categories:")
64
- for i, category in enumerate(ASPECT_DICTIONARIES.keys(), 1):
65
- print(f"{i}. {category}")
66
-
67
- category_choice = input(f"Select a product category (1-{len(ASPECT_DICTIONARIES)}):\n> ")
68
- try:
69
- category_idx = int(category_choice) - 1
70
- product_category = list(ASPECT_DICTIONARIES.keys())[category_idx]
71
- except (ValueError, IndexError):
72
- print("Invalid choice. Using 'Default' category.")
73
- product_category = "Default"
74
-
75
- # --- 3. Run Analysis ---
76
- print("\n--- Analyzing Review... ---")
77
-
78
- # a. Overall Sentiment
79
- sentiment_result = sentiment_classifier.classify(review_text)
80
-
81
- # b. Summary
82
- summary_result = summarizer.summarize(review_text)
83
-
84
- # c. Aspect Extraction and Analysis
85
- aspect_dictionary = ASPECT_DICTIONARIES.get(product_category)
86
- extracted_aspects = aspect_extractor.extract(review_text, aspect_dictionary)
87
- aspect_results = None
88
- if extracted_aspects:
89
- aspect_results = aspect_analyzer.analyze(review_text, extracted_aspects)
90
-
91
- # --- 4. Display Results ---
92
- print("\n-------------------- ANALYSIS RESULTS --------------------")
93
- print(f"\n[ Overall Sentiment ]")
94
- print(f" - Sentiment: {sentiment_result['label']} (Score: {sentiment_result['score']:.2f})")
95
-
96
- print(f"\n[ Generated Summary ]")
97
- print(f" - {summary_result}")
98
-
99
- print(f"\n[ Detected Aspect Sentiments ]")
100
- if aspect_results:
101
- for aspect, result in aspect_results.items():
102
- print(f" - {aspect.title()}: {result['sentiment']} (Score: {result['score']:.2f})")
103
- else:
104
- print(" - No relevant aspects from the dictionary were detected in the review.")
105
- print("----------------------------------------------------------")
106
-
107
-
108
  if __name__ == "__main__":
109
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # main.py
 
 
2
 
3
+ import torch
4
+ from langchain_community.embeddings import HuggingFaceEmbeddings
5
+ from langchain_community.llms import LlamaCpp
6
+ from langchain.memory import ConversationBufferMemory
7
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
8
+ from langchain.prompts import PromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate
9
+ import os
10
+ import argparse # For command-line arguments
11
+
12
+ # Import the logic functions from src
13
+ import src.pipeline as pipeline
14
+
15
+ # --- Global Objects & Setup ---
16
+ # (Similar setup as app.py, load models, prompts etc.)
17
+ print("--- Starting Local Execution Setup ---")
18
+ # 1. Check/Define Model Path
19
+ model_name = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"
20
+ if not os.path.exists(model_name):
21
+ print(f"ERROR: Model file '{model_name}' not found. Please download it first.")
22
  exit()
23
 
24
+ # 2. Prepare Default Sample Data (Optional, for context testing)
25
+ default_reviews_text = """...""" # Paste default laptop reviews
26
+ default_reviews_list = [r.strip() for r in default_reviews_text.strip().split('---') if r.strip()]
27
+
28
+ # 3. Load Embedding Model, Text Splitter
29
+ print("Loading embedding model and text splitter...")
30
+ model_kwargs = {'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
31
+ embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs=model_kwargs)
32
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=40)
33
+
34
+ # 4. Create Default Vector Store
35
+ print("Creating default FAISS vector store...")
36
+ default_vector_store = pipeline.create_vector_store_from_content(
37
+ "\n---\n".join(default_reviews_list), text_splitter, embeddings
38
+ )
39
+ if default_vector_store is None: raise ValueError("Failed to create default vector store!")
40
+
41
+ # 5. Load the LLM
42
+ print("Loading LLM (Mistral-7B GGUF)...")
43
+ llm = LlamaCpp(
44
+ model_path=model_name, n_gpu_layers=0, n_batch=512, n_ctx=4096,
45
+ f16_kv=True, temperature=0.0, max_tokens=512, verbose=False,
46
+ stop=["[/INST]", "User:", "Assistant:"]
47
+ )
48
+
49
+ # 6. Define All Prompts
50
+ print("Defining all prompts...")
51
+ # -- Phase 1 --
52
+ summary_template = """[INST] ... Reviews:\n{reviews} [/INST]\nConcise Summary:"""
53
+ summary_prompt = PromptTemplate(template=summary_template, input_variables=["reviews"])
54
+ aspect_template = """[INST] ... Reviews:\n{reviews} [/INST]\nKey Pros and Cons:"""
55
+ aspect_prompt = PromptTemplate(template=aspect_template, input_variables=["reviews"])
56
+ sentiment_template = """[INST] ... Reviews:\n{reviews} [/INST]\nOverall Sentiment (Score 1-10):"""
57
+ sentiment_prompt = PromptTemplate(template=sentiment_template, input_variables=["reviews"])
58
+ # -- Phase 2 --
59
+ condense_question_template = """[INST] Given the following conversation... Follow Up Input: {question} [/INST]\nStandalone question:"""
60
+ CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(condense_question_template)
61
+ qa_system_prompt = """[INST]
62
+ You are a factual assistant that answers only using the provided product reviews.
63
+ If the reviews include partial or uncertain information, summarize what they say.
64
+ If there is no information at all about the user’s question, respond with:
65
+ "I'm sorry, there isn't enough information in the reviews to answer that."
66
+
67
+ Do not use or infer information about price, comparisons to other brands, or availability unless they are directly mentioned in the reviews.
68
+ Always include a short "Evidence:" sentence if you found relevant mentions.
69
+
70
+ Context:
71
+ {context}
72
+
73
+ User question:
74
+ {question}
75
+ [/INST]
76
+ """
77
+ qa_prompt = ChatPromptTemplate.from_messages([SystemMessagePromptTemplate.from_template(qa_system_prompt), HumanMessagePromptTemplate.from_template("Context:\n{context}\n\nQuestion:\n{question}\n\nHelpful Answer:")])
78
+ intent_template = """
79
+ [INST]
80
+ **CRITICAL INSTRUCTION:** Classify the user's query into ONLY ONE of two categories: "Product" or "Off-Topic".
81
+ Your response MUST be EXACTLY "Product" or EXACTLY "Off-Topic".
82
+
83
+ **EXAMPLES:**
84
+ Query: How is the battery life?
85
+ Classification: Product
86
+ Query: What are the complaints about the screen?
87
+ Classification: Product
88
+ Query: Does it come in blue?
89
+ Classification: Product
90
+ Query: What is the capital of France?
91
+ Classification: Off-Topic
92
+ Query: Hello there
93
+ Classification: Off-Topic
94
+ Query: Who are you?
95
+ Classification: Off-Topic
96
+
97
+ **NOW CLASSIFY THIS QUERY:**
98
+ Query: {query}
99
+ [/INST]
100
+ Classification:"""
101
+ intent_prompt = PromptTemplate(template=intent_template, input_variables=["query"])
102
+
103
+ # 7. Memory Object (Needed for chatbot logic)
104
+ chat_memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True, output_key='answer')
105
+
106
+ print("--- Local Setup Complete ---")
107
+
108
+
109
+ # --- Main Execution Logic ---
 
 
 
 
 
 
 
 
 
 
110
  if __name__ == "__main__":
111
+ parser = argparse.ArgumentParser(description="Run ReviewSense Analysis or Chat locally.")
112
+ parser.add_argument("--mode", choices=['analyze', 'chat'], required=True, help="Mode to run: 'analyze' reviews from a file, or 'chat' interactively.")
113
+ parser.add_argument("--input", type=str, help="Path to input .txt file for 'analyze' mode, or initial query for 'chat' mode.")
114
+ parser.add_argument("--context", type=str, help="Optional: Path to a .txt file to use as context for 'chat' mode (defaults to built-in laptop reviews).")
115
+
116
+ args = parser.parse_args()
117
+
118
+ # --- ANALYZE MODE ---
119
+ if args.mode == 'analyze':
120
+ if not args.input or not os.path.exists(args.input):
121
+ print(f"Error: Input file '{args.input}' not found for analyze mode.")
122
+ exit()
123
+ print(f"\n--- Running Analysis on: {args.input} ---")
124
+ try:
125
+ with open(args.input, 'r', encoding='utf-8') as f:
126
+ review_content = f.read()
127
+ except Exception as e:
128
+ print(f"Error reading input file: {e}")
129
+ exit()
130
+
131
+ summary, aspects, sentiment = pipeline.analyze_reviews_logic(
132
+ review_content, llm, summary_prompt, aspect_prompt, sentiment_prompt
133
+ )
134
+ print("\n--- Analysis Results ---")
135
+ print("\n[Summary]")
136
+ print(summary)
137
+ print("\n[Aspects]")
138
+ print(aspects)
139
+ print("\n[Sentiment]")
140
+ print(sentiment)
141
+
142
+ # --- CHAT MODE ---
143
+ elif args.mode == 'chat':
144
+ print("\n--- Starting Interactive Chat ---")
145
+ # Determine context
146
+ chat_vector_store = default_vector_store
147
+ context_name = "Default Laptop Reviews"
148
+ if args.context:
149
+ if not os.path.exists(args.context):
150
+ print(f"Warning: Context file '{args.context}' not found. Using default context.")
151
+ else:
152
+ print(f"Loading context from: {args.context}")
153
+ try:
154
+ with open(args.context, 'r', encoding='utf-8') as f:
155
+ context_content = f.read()
156
+ chat_vector_store = pipeline.create_vector_store_from_content(
157
+ context_content, text_splitter, embeddings
158
+ )
159
+ if chat_vector_store:
160
+ context_name = f"File: {os.path.basename(args.context)}"
161
+ else:
162
+ print("Failed to load context file. Using default context.")
163
+ chat_vector_store = default_vector_store
164
+ except Exception as e:
165
+ print(f"Error reading context file '{args.context}': {e}. Using default context.")
166
+ chat_vector_store = default_vector_store
167
+
168
+ print(f"Using context: {context_name}")
169
+ chat_memory.clear() # Start fresh chat session
170
+
171
+ # Handle initial query if provided
172
+ if args.input:
173
+ print("\nUser:", args.input)
174
+ response = pipeline.get_chatbot_response(
175
+ message=args.input,
176
+ chat_memory=chat_memory,
177
+ vector_store=chat_vector_store,
178
+ llm=llm,
179
+ intent_prompt=intent_prompt,
180
+ condense_prompt=CONDENSE_QUESTION_PROMPT,
181
+ qa_prompt=qa_prompt
182
+ )
183
+ print("\nAssistant:", response)
184
+
185
+ # Interactive loop
186
+ print("\nEnter your questions (type 'quit' or 'exit' to stop):")
187
+ while True:
188
+ try:
189
+ user_message = input("\nUser: ")
190
+ if user_message.lower() in ['quit', 'exit']:
191
+ break
192
+ if not user_message:
193
+ continue
194
+
195
+ response = pipeline.get_chatbot_response(
196
+ message=user_message,
197
+ chat_memory=chat_memory,
198
+ vector_store=chat_vector_store,
199
+ llm=llm,
200
+ intent_prompt=intent_prompt,
201
+ condense_prompt=CONDENSE_QUESTION_PROMPT,
202
+ qa_prompt=qa_prompt
203
+ )
204
+ print("\nAssistant:", response)
205
+
206
+ except EOFError: # Handle Ctrl+D
207
+ break
208
+ except KeyboardInterrupt: # Handle Ctrl+C
209
+ break
210
+ print("\n--- Chat session ended. ---")
211
+
212
+ print("\n--- Local Execution Finished ---")
scripts/models.py DELETED
@@ -1,256 +0,0 @@
1
- import pytorch_lightning as pl
2
- from transformers import AutoModelForSequenceClassification, get_linear_schedule_with_warmup, AutoConfig
3
- from torch.optim import AdamW
4
- import torch
5
- from torchmetrics.functional import accuracy
6
- from transformers import T5ForConditionalGeneration, T5Tokenizer, AutoTokenizer, pipeline
7
-
8
- class SentimentClassifier(pl.LightningModule):
9
- """
10
- PyTorch Lightning module for the sentiment classification model.
11
- """
12
- def __init__(self, model_name='distilbert-base-uncased', n_classes=2, learning_rate=2e-5, n_warmup_steps=0, n_training_steps=0, dropout_prob=0.2): # Added dropout
13
- super().__init__()
14
- self.save_hyperparameters()
15
-
16
- # Configure dropout
17
- config = AutoConfig.from_pretrained(model_name)
18
- config.hidden_dropout_prob = dropout_prob
19
- config.attention_probs_dropout_prob = dropout_prob
20
- config.num_labels = n_classes
21
-
22
- self.model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config)
23
-
24
- def forward(self, input_ids, attention_mask, labels=None):
25
- return self.model(
26
- input_ids=input_ids,
27
- attention_mask=attention_mask,
28
- labels=labels
29
- )
30
-
31
- def training_step(self, batch, batch_idx):
32
- output = self.forward(**batch)
33
- self.log("train_loss", output.loss, prog_bar=True, logger=True)
34
- return output.loss
35
-
36
- def validation_step(self, batch, batch_idx):
37
- output = self.forward(**batch)
38
- preds = torch.argmax(output.logits, dim=1)
39
- val_acc = accuracy(preds, batch['labels'], task='binary')
40
- self.log("val_loss", output.loss, prog_bar=True, logger=True)
41
- self.log("val_accuracy", val_acc, prog_bar=True, logger=True)
42
- return output.loss
43
-
44
- def test_step(self, batch, batch_idx):
45
- output = self.forward(**batch)
46
- preds = torch.argmax(output.logits, dim=1)
47
- test_acc = accuracy(preds, batch['labels'], task='binary')
48
- self.log("test_accuracy", test_acc)
49
- return test_acc
50
-
51
- def predict_step(self, batch, batch_idx, dataloader_idx=0):
52
- output = self.forward(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'])
53
- return torch.argmax(output.logits, dim=1)
54
-
55
- def configure_optimizers(self):
56
- optimizer = AdamW(self.parameters(), lr=self.hparams.learning_rate, weight_decay=0.01)
57
- scheduler = get_linear_schedule_with_warmup(
58
- optimizer,
59
- num_warmup_steps=self.hparams.n_warmup_steps,
60
- num_training_steps=self.hparams.n_training_steps
61
- )
62
- return dict(optimizer=optimizer, lr_scheduler=dict(scheduler=scheduler, interval='step'))
63
-
64
- class ReviewSummarizer:
65
- """
66
- A class to handle the summarization of product reviews using a pre-trained T5 model.
67
- """
68
- def __init__(self, model_name='t5-small'):
69
- """
70
- Initializes the summarizer with a pre-trained T5 model and tokenizer.
71
-
72
- Args:
73
- model_name (str): The name of the pre-trained T5 model to use.
74
- """
75
- print(f"Loading summarization model: {model_name}...")
76
- self.model_name = model_name
77
- self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
78
-
79
- # Load the tokenizer and model from Hugging Face
80
- self.tokenizer = T5Tokenizer.from_pretrained(self.model_name)
81
- self.model = T5ForConditionalGeneration.from_pretrained(self.model_name).to(self.device)
82
- print("Summarization model loaded successfully.")
83
-
84
- def summarize(self, text: str, max_length: int = 50, min_length: int = 10) -> str:
85
- """
86
- Generates a summary for a given text.
87
-
88
- Args:
89
- text (str): The review text to summarize.
90
- max_length (int): The maximum length of the generated summary.
91
- min_length (int): The minimum length of the generated summary.
92
-
93
- Returns:
94
- str: The generated summary.
95
- """
96
- if not text or not isinstance(text, str):
97
- return ""
98
-
99
- # T5 models require a prefix for the task. For summarization, it's "summarize: "
100
- preprocess_text = f"summarize: {text.strip()}"
101
-
102
- # Tokenize the input text
103
- tokenized_text = self.tokenizer.encode(preprocess_text, return_tensors="pt").to(self.device)
104
-
105
- # Generate the summary
106
- summary_ids = self.model.generate(
107
- tokenized_text,
108
- max_length=max_length,
109
- min_length=min_length,
110
- length_penalty=2.0,
111
- num_beams=4,
112
- early_stopping=True
113
- )
114
-
115
- # Decode the summary and return it
116
- summary = self.tokenizer.decode(summary_ids[0], skip_special_tokens=True)
117
- return summary
118
-
119
- class AspectAnalyzer:
120
- """
121
- A class to handle Aspect-Based Sentiment Analysis (ABSA) using a pre-trained model.
122
- """
123
- # Changed to a different, currently valid lightweight model for ABSA.
124
- def __init__(self, model_name='yangheng/deberta-v3-base-absa-v1.1', force_cpu=False):
125
- """
126
- Initializes the ABSA pipeline with a pre-trained model.
127
-
128
- Args:
129
- model_name (str): The name of the pre-trained ABSA model.
130
- force_cpu (bool): If True, forces the model to run on the CPU.
131
- """
132
- print(f"Loading Aspect-Based Sentiment Analysis model: {model_name}...")
133
- self.model_name = model_name
134
-
135
- if force_cpu:
136
- self.device = -1 # Use -1 for CPU in pipeline
137
- print("Forcing ABSA model to run on CPU.")
138
- else:
139
- self.device = 0 if torch.cuda.is_available() else -1
140
-
141
- print(f"Using device: {self.device} (0 for GPU, -1 for CPU)")
142
-
143
- self.absa_pipeline = pipeline(
144
- "text-classification",
145
- model=self.model_name,
146
- tokenizer=self.model_name,
147
- device=self.device
148
- )
149
- print("ABSA model loaded successfully.")
150
-
151
- def analyze(self, text: str, aspects: list) -> dict:
152
- """
153
- Analyzes the sentiment towards a list of aspects within a given text.
154
- """
155
- if not text or not isinstance(text, str) or not aspects:
156
- return {}
157
-
158
- # The model expects the review and aspect separated by a special token.
159
- # Note: Different ABSA models might expect different input formats.
160
- # This format is common but may need adjustment for other models.
161
- inputs = [f"{text} [SEP] {aspect}" for aspect in aspects]
162
- results = self.absa_pipeline(inputs)
163
-
164
- # Process results into a user-friendly dictionary
165
- aspect_sentiments = {}
166
- for aspect, result in zip(aspects, results):
167
- aspect_sentiments[aspect] = {'sentiment': result['label'], 'score': result['score']}
168
-
169
- return aspect_sentiments
170
-
171
- class FineTunedSentimentClassifier:
172
- """
173
- This class handles loading the fine-tuned checkpoint and making predictions.
174
- """
175
- def __init__(self, checkpoint_path, model_name='distilbert-base-uncased', force_cpu=False):
176
- self.device = 'cpu' if force_cpu else ('cuda' if torch.cuda.is_available() else 'cpu')
177
- print(f"Loading fine-tuned sentiment model from checkpoint: {checkpoint_path}...")
178
- print(f"Using device: {self.device}")
179
-
180
- self.model = SentimentClassifier.load_from_checkpoint(checkpoint_path, map_location=self.device)
181
- self.model.to(self.device)
182
- self.model.eval() # Set model to evaluation mode
183
-
184
- self.tokenizer = AutoTokenizer.from_pretrained(model_name)
185
- self.labels = ['NEGATIVE', 'POSITIVE']
186
- print("Fine-tuned sentiment model loaded successfully.")
187
-
188
- def classify(self, text: str) -> dict:
189
- encoding = self.tokenizer.encode_plus(
190
- text, add_special_tokens=True, max_length=128,
191
- return_token_type_ids=False, padding="max_length",
192
- truncation=True, return_attention_mask=True, return_tensors='pt',
193
- )
194
- input_ids = encoding["input_ids"].to(self.device)
195
- attention_mask = encoding["attention_mask"].to(self.device)
196
- with torch.no_grad():
197
- outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
198
- logits = outputs.logits
199
- probabilities = torch.softmax(logits, dim=1)
200
- prediction_idx = torch.argmax(probabilities, dim=1).item()
201
- return {'label': self.labels[prediction_idx], 'score': probabilities[0][prediction_idx].item()}
202
-
203
- class AspectExtractor:
204
- """
205
- This class uses a Part-of-Speech (POS) tagging model to first extract all
206
- potential aspect terms (nouns) from a review text. It then filters these
207
- nouns against a pre-defined dictionary of valid aspects for a given
208
- product category to return only the relevant features.
209
- """
210
- def __init__(self, model_name="vblagoje/bert-english-uncased-finetuned-pos", force_cpu=False):
211
- self.model_name = model_name
212
- self.device = 'cpu' if force_cpu else ('cuda' if torch.cuda.is_available() else 'cpu')
213
- print(f"Loading Part-of-Speech (POS) tagging model: {self.model_name}...")
214
- print(f"Using device: {self.device}")
215
-
216
- self.pipeline = pipeline(
217
- "token-classification",
218
- model=self.model_name,
219
- device=-1 if self.device == 'cpu' else 0,
220
- aggregation_strategy="simple"
221
- )
222
- print("POS tagging model loaded successfully.")
223
-
224
- def extract(self, text: str, aspect_dictionary: list) -> list:
225
- """
226
- Extracts aspects from the given text that are present in the provided
227
- aspect dictionary.
228
-
229
- Args:
230
- text (str): The review text to analyze.
231
- aspect_dictionary (list): A list of valid, known aspects for the
232
- product category.
233
-
234
- Returns:
235
- list: A list of aspects that were both found in the text and are
236
- present in the aspect dictionary.
237
- """
238
- if not text or not aspect_dictionary:
239
- return []
240
-
241
- # 1. Extract all nouns from the text using the POS model
242
- model_outputs = self.pipeline(text)
243
- noun_tags = {'NOUN', 'PROPN'}
244
- extracted_nouns = {
245
- output['word'].lower() for output in model_outputs
246
- if output['entity_group'] in noun_tags
247
- }
248
-
249
- # 2. Filter the extracted nouns against the provided dictionary
250
- # We find the intersection between the two sets.
251
- valid_aspects = {aspect.lower() for aspect in aspect_dictionary}
252
-
253
- final_aspects = list(extracted_nouns.intersection(valid_aspects))
254
-
255
- return final_aspects
256
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/pipeline.py ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # src/pipeline.py
2
+
3
+ import os
4
+ import io
5
+ from langchain_community.vectorstores import FAISS
6
+ from langchain.chains import ConversationalRetrievalChain
7
+ from langchain.prompts import PromptTemplate # Ensure necessary Langchain imports are here if needed directly
8
+
9
+ # --- Core Logic Functions ---
10
+
11
+ def analyze_reviews_logic(review_text: str, llm, summary_prompt, aspect_prompt, sentiment_prompt):
12
+ """
13
+ Performs Phase 1 analysis (summary, aspects, sentiment) on the provided text.
14
+ """
15
+ print(f"Running batch analysis logic on {len(review_text)} chars...")
16
+ try:
17
+ summary_result = llm.invoke(summary_prompt.format(reviews=review_text)).strip()
18
+ print(" -> Summary generated.")
19
+ aspect_result = llm.invoke(aspect_prompt.format(reviews=review_text)).strip()
20
+ print(" -> Aspects extracted.")
21
+ sentiment_result = llm.invoke(sentiment_prompt.format(reviews=review_text)).strip()
22
+ print(" -> Sentiment analyzed.")
23
+ return summary_result, aspect_result, sentiment_result
24
+ except Exception as e:
25
+ print(f"ERROR during batch analysis logic: {e}")
26
+ error_msg = f"Error during analysis: {e}"
27
+ return error_msg, error_msg, error_msg
28
+
29
+ def create_vector_store_from_content(content: str, text_splitter, embeddings):
30
+ """
31
+ Splits content and creates a new FAISS vector store.
32
+ Returns the vector store or None if an error occurs.
33
+ """
34
+ print("Creating new vector store from content...")
35
+ if not content:
36
+ print("Error: No content provided to create vector store.")
37
+ return None
38
+
39
+ # Split content
40
+ if "\n---\n" in content:
41
+ reviews_list = [r.strip() for r in content.strip().split('\n---\n') if r.strip()]
42
+ else:
43
+ reviews_list = [r.strip() for r in content.strip().split('\n\n') if r.strip()]
44
+ if len(reviews_list) <= 1: reviews_list = [content.strip()] # Single block case
45
+
46
+ if not reviews_list:
47
+ print("Error: Could not extract reviews from content.")
48
+ return None
49
+
50
+ review_chunks = text_splitter.create_documents(reviews_list)
51
+ if not review_chunks:
52
+ print("Error: Failed to create document chunks.")
53
+ return None
54
+
55
+ try:
56
+ vector_store = FAISS.from_documents(review_chunks, embeddings)
57
+ print("Vector store created successfully.")
58
+ return vector_store
59
+ except Exception as e:
60
+ print(f"Error creating FAISS index: {e}")
61
+ return None
62
+
63
+ def parse_intent(llm_output: str) -> str:
64
+ """
65
+ Parses the LLM output to find 'Product' or 'Off-Topic'.
66
+ Defaults to 'Off-Topic' if neither is found or output is unexpected.
67
+ Uses case-insensitive 'in' check for robustness.
68
+ """
69
+ output_lower = llm_output.strip().lower()
70
+ if "product" in output_lower:
71
+ return "Product"
72
+ elif "off-topic" in output_lower:
73
+ return "Off-Topic"
74
+ else:
75
+ print(f" -> Unexpected classification: '{llm_output.strip()}'. Defaulting to Off-Topic.")
76
+ return "Off-Topic"
77
+
78
+ def get_chatbot_response(message: str, chat_memory, vector_store, llm, intent_prompt, condense_prompt, qa_prompt):
79
+ """
80
+ Handles Phase 2: Classifies intent and runs RAG if appropriate.
81
+ Returns the chatbot's response string.
82
+ """
83
+ print(f"\nProcessing chatbot query: {message}")
84
+
85
+ # --- 1. Classify Intent ---
86
+ formatted_intent_prompt = intent_prompt.format(query=message)
87
+ intent_result_raw = llm.invoke(formatted_intent_prompt)
88
+ print(f" DEBUG: Raw Intent Output: '{intent_result_raw.strip()}'")
89
+ intent = parse_intent(intent_result_raw)
90
+ print(f" -> Detected Intent: {intent}")
91
+
92
+ # --- 2. Route ---
93
+ if intent == "Product":
94
+ print(" -> Routing to RAG chain...")
95
+ if vector_store is None:
96
+ print(" ERROR: No vector store available for RAG.")
97
+ return "Sorry, I don't have any review context loaded to answer product questions."
98
+
99
+ retriever = vector_store.as_retriever(search_kwargs={"k": 4})
100
+
101
+ # Create chain dynamically for each call
102
+ conv_qa_chain = ConversationalRetrievalChain.from_llm(
103
+ llm=llm,
104
+ retriever=retriever,
105
+ memory=chat_memory,
106
+ condense_question_prompt=condense_prompt,
107
+ combine_docs_chain_kwargs={"prompt": qa_prompt},
108
+ return_source_documents=True, # Required for context list in result
109
+ verbose=False
110
+ )
111
+ try:
112
+ # Pass only question - memory handles history internally
113
+ result = conv_qa_chain.invoke({"question": message})
114
+ answer = result['answer'].strip()
115
+ print(f" -> RAG Answer: {answer}")
116
+ return answer
117
+ except Exception as e:
118
+ print(f"ERROR during RAG chain execution: {e}")
119
+ # Optionally log traceback: import traceback; traceback.print_exc()
120
+ return "Sorry, I encountered an error trying to find an answer in the reviews."
121
+
122
+ else: # Off-Topic
123
+ print(" -> Routing to canned response...")
124
+ answer = "I'm sorry, I can only answer questions about the product reviews for this item."
125
+ # Optional: Save off-topic turn to memory if desired
126
+ # chat_memory.save_context({"question": message}, {"answer": answer})
127
+ return answer
scripts/train_distilbet.py DELETED
@@ -1,101 +0,0 @@
1
- import pytorch_lightning as pl
2
- from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping
3
- from pytorch_lightning.loggers import TensorBoardLogger
4
- import torch
5
- import seaborn as sns
6
- import matplotlib.pyplot as plt
7
- from sklearn.metrics import confusion_matrix
8
- from data_prepare import ReviewDataModule, ReviewDataset
9
- from models import SentimentClassifier
10
-
11
- def train_sentiment_model(data_path='data/reviews_processed.csv', model_name='distilbert-base-uncased', n_epochs=5, sample_size: int = None):
12
- """
13
- Main function to train the sentiment analysis model on the Amazon Reviews dataset.
14
-
15
- Args:
16
- data_path (str): Path to the processed data file.
17
- model_name (str): Name of the transformer model to use.
18
- n_epochs (int): Maximum number of epochs for training.
19
- sample_size (int, optional): The number of reviews to use for training.
20
- If None, the full dataset is used.
21
- """
22
- # --- 1. Hyperparameters ---
23
- BATCH_SIZE = 64
24
- MAX_TOKEN_LEN = 256
25
- LEARNING_RATE = 2e-5
26
- N_CLASSES = 2 # Negative, Positive
27
-
28
- # --- 2. Initialize DataModule ---
29
- print("Initializing ReviewDataModule...")
30
- review_datamodule = ReviewDataModule(
31
- data_path=data_path,
32
- batch_size=BATCH_SIZE,
33
- max_token_len=MAX_TOKEN_LEN,
34
- model_name=model_name,
35
- sample_size=sample_size # Pass the sample size to the datamodule
36
- )
37
- review_datamodule.setup()
38
-
39
- n_training_steps = len(review_datamodule.train_dataloader()) * n_epochs
40
- n_warmup_steps = int(n_training_steps * 0.1)
41
-
42
- # --- 3. Initialize Model ---
43
- print("Initializing SentimentClassifier model...")
44
- model = SentimentClassifier(
45
- model_name=model_name,
46
- n_classes=N_CLASSES,
47
- learning_rate=LEARNING_RATE,
48
- n_warmup_steps=n_warmup_steps,
49
- n_training_steps=n_training_steps
50
- )
51
-
52
- # --- 4. Configure Training Callbacks ---
53
- checkpoint_callback = ModelCheckpoint(
54
- dirpath="checkpoints",
55
- filename="sentiment-binary-best-checkpoint",
56
- save_top_k=1,
57
- verbose=True,
58
- monitor="val_loss",
59
- mode="min"
60
- )
61
- logger = TensorBoardLogger("lightning_logs", name="sentiment-classifier-binary")
62
- early_stopping_callback = EarlyStopping(monitor='val_loss', patience=2)
63
-
64
- # --- 5. Initialize Trainer ---
65
- print("Initializing PyTorch Lightning Trainer...")
66
- trainer = pl.Trainer(
67
- logger=logger,
68
- callbacks=[checkpoint_callback, early_stopping_callback],
69
- max_epochs=n_epochs,
70
- accelerator='gpu' if torch.cuda.is_available() else 'cpu',
71
- devices=1,
72
- )
73
-
74
- # --- 6. Start Training ---
75
- print(f"Starting training with {model_name} for up to {n_epochs} epochs...")
76
- trainer.fit(model, review_datamodule)
77
-
78
- # --- 7. Evaluate on Test Set and Generate Confusion Matrix ---
79
- print("\nTraining complete. Evaluating on the test set...")
80
- trainer.test(model, datamodule=review_datamodule)
81
-
82
- predictions = trainer.predict(model, datamodule=review_datamodule)
83
- if predictions:
84
- all_preds = torch.cat(predictions).cpu().numpy()
85
- true_labels = review_datamodule.test_df.sentiment.to_numpy()
86
- target_names = ['Negative', 'Positive'] # Updated labels
87
-
88
- cm = confusion_matrix(true_labels, all_preds)
89
- plt.figure(figsize=(8, 6))
90
- sns.heatmap(cm, annot=True, fmt='d', cmap='YlGnBu',
91
- xticklabels=target_names, yticklabels=target_names)
92
- plt.title('Confusion Matrix for Sentiment Analysis')
93
- plt.xlabel('Predicted Label')
94
- plt.ylabel('True Label')
95
- plt.show()
96
-
97
-
98
-
99
- if __name__ == "__main__":
100
- data_path = "data/reviews_processed.csv"
101
- train_sentiment_model(data_path=data_path, sample_size=100000)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/train_naive_bayes.py DELETED
@@ -1,118 +0,0 @@
1
- import pandas as pd
2
- import numpy as np
3
- from sklearn.model_selection import train_test_split, ParameterGrid, StratifiedKFold
4
- from sklearn.feature_extraction.text import TfidfVectorizer
5
- from sklearn.naive_bayes import MultinomialNB
6
- from sklearn.pipeline import Pipeline
7
- from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
8
- import seaborn as sns
9
- import matplotlib.pyplot as plt
10
- from tqdm.notebook import tqdm
11
- import os
12
-
13
- def train_baseline_sentiment_model(data_path='data/reviews_processed.csv', grid_search=True, nb__alpha=0.1, tfidf__max_df=0.75, tfidf__ngram_range=(1, 2), sample_size: int = 50000):
14
- """
15
- Trains and evaluates a Multinomial Naive Bayes model for sentiment analysis.
16
- Can optionally perform a grid search.
17
-
18
- Args:
19
- data_path (str): Path to the processed reviews CSV file.
20
- grid_search (bool): If True, performs a grid search.
21
- nb__alpha (float): Alpha for MultinomialNB.
22
- tfidf__max_df (float): max_df for TfidfVectorizer.
23
- tfidf__ngram_range (tuple): ngram_range for TfidfVectorizer.
24
- sample_size (int, optional): Number of reviews to use. If None, uses all.
25
- """
26
- # --- 1. Load Data ---
27
- print(f"Loading data from '{data_path}'...")
28
- if not os.path.exists(data_path):
29
- print(f"\nERROR: '{data_path}' not found. Please run the EDA script first!")
30
- return
31
-
32
- df = pd.read_csv(data_path)
33
- df.dropna(inplace=True)
34
-
35
- # --- 2. Sample Data ---
36
- if sample_size:
37
- print(f"Using a sample of {sample_size} reviews for training the baseline model.")
38
- df = df.sample(n=sample_size, random_state=42)
39
-
40
- # --- 3. Train-Test Split ---
41
- print("Splitting data into training and testing sets...")
42
- X_train, X_test, y_train, y_test = train_test_split(
43
- df['full_text'],
44
- df['sentiment'],
45
- test_size=0.2,
46
- random_state=42,
47
- stratify=df['sentiment']
48
- )
49
-
50
- # --- 4. Create a Pipeline ---
51
- pipeline = Pipeline([
52
- ('tfidf', TfidfVectorizer(stop_words='english')),
53
- ('nb', MultinomialNB()),
54
- ])
55
-
56
- best_params = None
57
-
58
- if grid_search:
59
- # --- 5a. Perform Grid Search ---
60
- print("Performing Grid Search to find the best hyperparameters...")
61
- parameters = {
62
- 'tfidf__ngram_range': [(1, 1), (1, 2)],
63
- 'tfidf__max_df': [0.5, 0.75, 1.0],
64
- 'nb__alpha': [0.1, 0.5, 1.0],
65
- }
66
- param_grid = list(ParameterGrid(parameters))
67
- best_score = -1
68
-
69
- for params in tqdm(param_grid, desc="Grid Search Progress"):
70
- pipeline.set_params(**params)
71
- pipeline.fit(X_train, y_train)
72
- score = pipeline.score(X_test, y_test)
73
- if score > best_score:
74
- best_score = score
75
- best_params = params
76
-
77
- print(f"\nBest score on test set: {best_score:.4f}")
78
- print("Best parameters found:")
79
- print(best_params)
80
-
81
- else:
82
- # --- 5b. Use provided hyperparameters ---
83
- print("Skipping grid search and using provided hyperparameters...")
84
- best_params = {
85
- 'nb__alpha': nb__alpha,
86
- 'tfidf__max_df': tfidf__max_df,
87
- 'tfidf__ngram_range': tfidf__ngram_range
88
- }
89
-
90
- # --- 6. Train the Final Model ---
91
- print("\nTraining final model...")
92
- best_model = pipeline.set_params(**best_params)
93
- best_model.fit(X_train, y_train)
94
- print("Model training complete.")
95
-
96
- # --- 7. Evaluate the Best Model ---
97
- print("\n--- Model Evaluation ---")
98
- y_pred = best_model.predict(X_test)
99
-
100
- accuracy = accuracy_score(y_test, y_pred)
101
- target_names = ['Negative', 'Positive']
102
-
103
- print(f"Accuracy: {accuracy:.4f}")
104
- print("\nClassification Report:")
105
- print(classification_report(y_test, y_pred, target_names=target_names))
106
-
107
- print("Confusion Matrix:")
108
- cm = confusion_matrix(y_test, y_pred)
109
- plt.figure(figsize=(8, 6))
110
- sns.heatmap(cm, annot=True, fmt='d', cmap='Greens',
111
- xticklabels=target_names, yticklabels=target_names)
112
- plt.title('Confusion Matrix for Naive Bayes on Amazon Reviews')
113
- plt.xlabel('Predicted Label')
114
- plt.ylabel('True Label')
115
- plt.show()
116
-
117
- if __name__ == "__main__":
118
- train_baseline_sentiment_model(sample_size=150000, grid_search=False)