Spaces:
Sleeping
Sleeping
Commit
·
64fbadd
1
Parent(s):
43c6927
Feat: Introduce ReviewSense v2.0 with RAG Chatbot and Mistral LLM
Browse files- README.md +201 -94
- assets/{confusion_bay.png → tab1.PNG} +2 -2
- assets/{confusion_bert.png → tab2.PNG} +2 -2
- assets/wordcloud.png +0 -3
- checkpoints/sentiment-binary-best-checkpoint.ckpt +0 -3
- notebooks/reviewsense.ipynb +0 -1001
- requirements.txt +15 -10
- scripts/app.py +261 -144
- scripts/data_prepare.py +0 -263
- scripts/main.py +208 -105
- scripts/models.py +0 -256
- scripts/pipeline.py +127 -0
- scripts/train_distilbet.py +0 -101
- scripts/train_naive_bayes.py +0 -118
README.md
CHANGED
|
@@ -1,188 +1,295 @@
|
|
| 1 |

|
| 2 |
-
[
|
| 19 |
-
- [
|
| 20 |
-
- [
|
| 21 |
-
- [🧠 How It Works: The
|
|
|
|
|
|
|
| 22 |
- [🔮 Future Improvements](#-future-improvements)
|
| 23 |
-
- [⚙️ Setup and Installation](#️-setup-and-installation)
|
| 24 |
-
- [▶️ Usage](#️-usage)
|
| 25 |
-
- [📁 Project Structure](#-project-structure)
|
| 26 |
-
- [🛠️ Technologies and Models](#️-technologies-and-models)
|
|
|
|
| 27 |
|
| 28 |
---
|
| 29 |
|
| 30 |
## 📖 Overview
|
| 31 |
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
This project solves that problem by creating an automated system that performs a **multi-layered analysis** on any given product review, providing a structured output that is far more valuable than a simple positive/negative label.
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
---
|
| 39 |
|
| 40 |
-
##
|
| 41 |
|
| 42 |
-
|
| 43 |
-
Classifies reviews as **POSITIVE** or **NEGATIVE**, powered by a fine-tuned DistilBERT model on the Amazon Reviews dataset.
|
| 44 |
|
| 45 |
-
-
|
| 46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
-
|
| 49 |
-
Determines specific sentiment (*Positive, Negative, Neutral*) for each aspect, e.g., *“loved the camera, disappointed with the battery life.”*
|
| 50 |
|
| 51 |
-
|
| 52 |
-
Generates concise summaries of reviews using a pre-trained DistilBART model.
|
| 53 |
|
| 54 |
-
-
|
| 55 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
---
|
| 58 |
|
| 59 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
-
|
| 62 |
|
| 63 |
-
|
| 64 |
-
We first trained a classic `Multinomial Naive Bayes` model. The text was vectorized using `TF-IDF`, and we performed a hyperparameter grid search to find the optimal settings. This approach is fast, interpretable, and provides a strong benchmark for text classification.
|
| 65 |
|
| 66 |
-
|
| 67 |
|
| 68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
-
|
| 71 |
-
Next, we fine-tuned a `DistilBERT` model, a smaller and more efficient variant of BERT. By leveraging its pre-trained understanding of the English language and fine-tuning it on our specific Amazon reviews data, we aimed to capture more of the nuance and context that the baseline model might miss.
|
| 72 |
|
| 73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
-
|
| 76 |
|
| 77 |
-
|
| 78 |
|
| 79 |
-
|
| 80 |
|
| 81 |
-
|
| 82 |
|
| 83 |
-
|
| 84 |
|
| 85 |
-
|
| 86 |
|
| 87 |
-
|
|
|
|
|
|
|
|
|
|
| 88 |
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
---
|
| 97 |
|
| 98 |
## 🔮 Future Improvements
|
| 99 |
|
| 100 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
-
|
| 103 |
|
| 104 |
-
|
| 105 |
-
|
|
|
|
| 106 |
|
| 107 |
---
|
| 108 |
|
| 109 |
## ⚙️ Setup and Installation
|
| 110 |
|
| 111 |
-
|
| 112 |
|
| 113 |
```bash
|
| 114 |
-
git clone https://github.com/Deathshot78/ReviewSense.git
|
| 115 |
cd ReviewSense
|
| 116 |
```
|
| 117 |
|
| 118 |
-
|
| 119 |
|
| 120 |
```bash
|
| 121 |
pip install -r requirements.txt
|
| 122 |
```
|
| 123 |
|
| 124 |
-
|
| 125 |
|
| 126 |
-
|
| 127 |
|
| 128 |
-
|
| 129 |
|
| 130 |
-
|
| 131 |
|
| 132 |
-
|
| 133 |
-
python data_prepare.py
|
| 134 |
-
```
|
| 135 |
|
| 136 |
-
|
| 137 |
|
| 138 |
```bash
|
| 139 |
-
python
|
| 140 |
```
|
| 141 |
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
## ▶️ Usage
|
| 145 |
|
| 146 |
-
|
| 147 |
|
| 148 |
-
|
| 149 |
-
python python main.py
|
| 150 |
-
```
|
| 151 |
|
| 152 |
---
|
| 153 |
|
| 154 |
-
## 📁 Project Structure
|
| 155 |
|
| 156 |
```bash
|
| 157 |
-
|
| 158 |
-
├──
|
| 159 |
-
├──
|
| 160 |
-
├──
|
| 161 |
-
├──
|
| 162 |
-
|
| 163 |
-
├── 📄 requirements.txt #
|
| 164 |
-
├──
|
| 165 |
-
└──
|
| 166 |
```
|
| 167 |
|
| 168 |
---
|
| 169 |
|
| 170 |
-
## 🛠️ Technologies and Models
|
| 171 |
|
| 172 |
**Core Technologies**
|
| 173 |
|
| 174 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 175 |
|
| 176 |
-
|
| 177 |
|
| 178 |
-
-
|
| 179 |
|
| 180 |
-
|
| 181 |
|
| 182 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 183 |
|
| 184 |
-
|
| 185 |
|
| 186 |
-
|
| 187 |
|
| 188 |
-
|
|
|
|
| 1 |

|
| 2 |
+
[](https://www.python.org/)[](https://python.langchain.com/)[](LICENSE)
|
| 3 |
|
| 4 |
+
# 🛍️ ReviewSense v2.0: Product Review Analysis & Chatbot Engine
|
| 5 |
|
| 6 |
+
> *ReviewSense v2.0 expands upon the initial analysis engine, adding a powerful, conversational RAG (Retrieval-Augmented Generation) chatbot. It leverages a single, efficient LLM (Mistral 7B GGUF) to provide both deep batch analysis and interactive Q&A grounded in user-provided reviews.*
|
|
|
|
| 7 |
|
| 8 |
+
This project demonstrates an end-to-end workflow, integrating data processing, local LLM execution with `LlamaCpp`, vector storage with FAISS, conversational memory, intent classification, and an interactive Gradio web application.
|
| 9 |
|
| 10 |
+

|
| 11 |
+

|
| 12 |
+
|
| 13 |
+
You can find the Web demo here ➡ [Web Demo](https://huggingface.co/spaces/Deathshot78/ReviewSense)
|
| 14 |
+
**[Note]: running this model on the cpu takes a while to complete you can relax and get a cup of coffee while the model generates responses !☕**
|
| 15 |
|
| 16 |
---
|
| 17 |
|
| 18 |
## 📋 Table of Contents
|
| 19 |
|
| 20 |
+
- [📖 Overview](#-overview)
|
| 21 |
+
- [🚀 What's New in v2.0](#-whats-new-in-v20)
|
| 22 |
+
- [✨ Key Features (v2.0)](#-key-features-v20)
|
| 23 |
+
- [🧠 How It Works: The v2.0 Pipeline](#-how-it-works-the-v20-pipeline)
|
| 24 |
+
- [🔧 Challenges & Limitations](#-challenges--limitations)
|
| 25 |
+
- [💡 Prompt Engineering Journey](#-prompt-engineering-journey)
|
| 26 |
- [🔮 Future Improvements](#-future-improvements)
|
| 27 |
+
- [⚙️ Setup and Installation](#️-setup-and-installation)
|
| 28 |
+
- [▶️ Usage](#️-usage)
|
| 29 |
+
- [📁 Project Structure (v2.0)](#-project-structure-v20)
|
| 30 |
+
- [🛠️ Technologies and Models (v2.0)](#️-technologies-and-models-v20)
|
| 31 |
+
- [📜 Version History](#-version-history)
|
| 32 |
|
| 33 |
---
|
| 34 |
|
| 35 |
## 📖 Overview
|
| 36 |
|
| 37 |
+
Building upon the foundation of ReviewSense v1.0, which focused on extracting insights like sentiment, aspects, and summaries using multiple specialized models, **Version 2.0 introduces a significant upgrade: a conversational chatbot**.
|
|
|
|
|
|
|
| 38 |
|
| 39 |
+
This chatbot allows users to ask specific questions about product reviews and receive answers synthesized directly from the provided text. To achieve this efficiently and enhance overall capabilities, v2.0 consolidates the architecture around a single, powerful yet locally runnable Large Language Model (Mistral 7B GGUF). This unified model now handles both the batch analysis tasks (with improved quality) and the interactive Q&A, demonstrating a modern approach to building multi-functional NLP applications.
|
| 40 |
|
| 41 |
---
|
| 42 |
|
| 43 |
+
## 🚀 What's New in v2.0
|
| 44 |
|
| 45 |
+
Version 2.0 represents a major leap in functionality and architecture:
|
|
|
|
| 46 |
|
| 47 |
+
1. **🤖 RAG Chatbot Implementation:** Added an interactive chatbot (Phase 2) that uses Retrieval-Augmented Generation (RAG) to answer user questions based on review context.
|
| 48 |
+
2. **🧠 Single LLM Architecture:** Replaced the multiple specialized models (DistilBERT, DistilBART, DeBERTa, POS Tagger) from v1.0 with a single, powerful Mistral 7B GGUF model, executed locally via `LlamaCpp`. This model now handles:
|
| 49 |
+
* Batch Analysis (Summary, Aspects, Sentiment - Phase 1) with higher quality.
|
| 50 |
+
* RAG-based Question Answering (Phase 2).
|
| 51 |
+
* Intent Classification (Guardrail for Phase 2).
|
| 52 |
+
3. **📄 Dynamic Context Management:** The chatbot can now operate on a default set of reviews or dynamically update its knowledge base using user-uploaded `.txt` or `.csv` files.
|
| 53 |
+
4. **💬 Conversational Memory:** Integrated LangChain's `ConversationBufferMemory` allowing the chatbot to understand follow-up questions.
|
| 54 |
+
5. **🛡️ Intent Classification Guardrail:** Implemented a robust intent classifier (using the same LLM) to prevent the chatbot from answering off-topic questions, ensuring responses stay grounded in product reviews.
|
| 55 |
+
6. **🖥️ Unified Gradio UI:** Developed a two-tab Gradio interface (`app.py`) providing access to both the Batch Analyzer and the RAG Chatbot in a single application.
|
| 56 |
+
7. **💻 Local Execution Script:** Added `main.py` for command-line execution of batch analysis or interactive chat without the Gradio UI.
|
| 57 |
+
8. **🧱 Modular Code Structure:** Refactored code into `src/pipeline.py` for core logic, improving organization and maintainability.
|
| 58 |
+
|
| 59 |
+
---
|
| 60 |
|
| 61 |
+
## ✨ Key Features (v2.0)
|
|
|
|
| 62 |
|
| 63 |
+
Includes all features from v1.0 (now powered by Mistral 7B) **plus**:
|
|
|
|
| 64 |
|
| 65 |
+
- **Interactive RAG Chatbot:**
|
| 66 |
+
* Ask specific questions about product reviews (e.g., "How is the battery life?", "Is the app reliable?").
|
| 67 |
+
* Answers synthesized directly from provided review context using RAG.
|
| 68 |
+
* **Conversational Memory:** Understands follow-up questions ("What about the screen?").
|
| 69 |
+
* **Grounded Responses:** Designed to answer only based on the reviews provided.
|
| 70 |
+
* **Intent Guardrail:** Filters out and responds appropriately to off-topic questions.
|
| 71 |
+
- **Dynamic Context Loading:**
|
| 72 |
+
* Chatbot operates on default reviews or context loaded from user-uploaded files (`.txt`/`.csv`).
|
| 73 |
+
* Clear indication of the currently active context.
|
| 74 |
+
- **Unified LLM Backend:** All NLP tasks (analysis, Q&A, classification) handled by a single Mistral 7B GGUF model running locally.
|
| 75 |
+
- **Dual Interface:** Accessible via Gradio web UI (`app.py`) or command line (`main.py`).
|
| 76 |
|
| 77 |
---
|
| 78 |
|
| 79 |
+
## 🧠 How It Works: The v2.0 Pipeline
|
| 80 |
+
|
| 81 |
+
**Phase 1: Batch Analysis (via `analyze_reviews_only` or `analyze_reviews_logic`)**
|
| 82 |
+
1. User provides review text (paste or file).
|
| 83 |
+
2. The text is passed to the Mistral LLM using three distinct prompts (Summarization, Aspect Extraction, Sentiment Analysis).
|
| 84 |
+
3. The LLM generates the three analysis outputs.
|
| 85 |
+
|
| 86 |
+
**Phase 2: RAG Chatbot (via `ask_question_with_guardrail` or `get_chatbot_response`)**
|
| 87 |
+
1. User asks a question.
|
| 88 |
+
2. **Intent Classification:** The query is first sent to the Mistral LLM with the `intent_prompt` (few-shot) to classify it as "Product" or "Off-Topic". Robust parsing checks the LLM output.
|
| 89 |
+
3. **Routing:**
|
| 90 |
+
* If "Off-Topic", a canned response is returned.
|
| 91 |
+
* If "Product", proceed to RAG.
|
| 92 |
+
4. **Context Retrieval:** The user's question is used to query the current FAISS vector store (containing embeddings of the active review context) to retrieve the top `k` relevant review snippets.
|
| 93 |
+
5. **Conversational Chain Execution (`ConversationalRetrievalChain`):**
|
| 94 |
+
* **Condense Question:** If there's chat history, the LLM uses `CONDENSE_QUESTION_PROMPT` to rephrase the current question into a standalone query.
|
| 95 |
+
* **RAG Generation:** The condensed question and retrieved context snippets are passed to the LLM with the strict `qa_prompt`. The LLM synthesizes an answer based *only* on the provided context.
|
| 96 |
+
* **Memory Update:** The question and final answer are added to the `ConversationBufferMemory`.
|
| 97 |
+
6. **Response:** The synthesized answer is returned to the user.
|
| 98 |
|
| 99 |
+
---
|
| 100 |
|
| 101 |
+
## 🔧 Challenges & Limitations
|
|
|
|
| 102 |
|
| 103 |
+
Developing v2.0 involved significant experimentation and revealed several challenges:
|
| 104 |
|
| 105 |
+
1. **Consistent Instruction Following:** While powerful, the Mistral 7B GGUF model sometimes struggled to consistently follow complex negative constraints or nuanced instructions in prompts, especially within the RAG chain. This led to:
|
| 106 |
+
* **Context Leakage:** Occasionally including irrelevant details from retrieved chunks (e.g., mentioning webcam when asked about battery).
|
| 107 |
+
* **Hallucination:** Making up information (e.g., mentioning "phone" for laptop battery, inventing prices or product names).
|
| 108 |
+
* **Over-Cautiousness:** Incorrectly stating "cannot find information" even when relevant details were present in the context, particularly for negative aspects (e.g., hardware issues).
|
| 109 |
+
* **Misinterpretation:** Failing to correctly understand the specific user question (e.g., "taste" vs. "type", comparison questions).
|
| 110 |
+
2. **Prompt Engineering Complexity:** Finding the right prompt structure required extensive iteration. Simple prompts lacked control, while overly complex prompts sometimes confused the model. Few-shot prompting proved essential for reliable intent classification. Balancing strictness (for grounding) with flexibility (to allow synthesis) in the RAG prompt was difficult.
|
| 111 |
+
3. **Intent Classification Brittleness:** Getting the LLM to output *only* the classification label required moving from zero-shot, to strict instructions, to few-shot examples, and finally adding robust parsing logic (`parse_intent`) to handle noisy LLM outputs reliably.
|
| 112 |
+
4. **Performance:** Running the 7B parameter GGUF model on a CPU is significantly slower than using smaller models or GPU acceleration. Batch analysis and RAG responses take noticeable time (though acceptable for demonstration).
|
| 113 |
+
5. **Evaluation Bottleneck:** Using external APIs (like OpenAI) for RAGAs evaluation can incur costs and hit rate limits. Using the local model for evaluation is free but slower and potentially less objective.
|
| 114 |
|
| 115 |
+
---
|
|
|
|
| 116 |
|
| 117 |
+
## 💡 Prompt Engineering Journey
|
| 118 |
+
|
| 119 |
+
Achieving the final, relatively stable performance required significant iteration on the prompts, particularly for the RAG chain (`qa_prompt`) and intent classification (`intent_prompt`).
|
| 120 |
+
|
| 121 |
+
**Intent Classification (`intent_prompt`):**
|
| 122 |
+
|
| 123 |
+
* Initial attempts with simple zero-shot prompts failed, with the model providing verbose, incorrect classifications.
|
| 124 |
+
* Adding strict formatting rules (`MUST BE EXACTLY...`) helped but wasn't sufficient.
|
| 125 |
+
* **Few-Shot Prompting** (providing explicit examples within the prompt) proved crucial for forcing the model to output the correct labels, although often with extra text.
|
| 126 |
+
* **Robust Parsing (`parse_intent`)** was added to reliably extract the core "Product" or "Off-Topic" keyword from the model's potentially noisy output.
|
| 127 |
+
|
| 128 |
+
**Final `intent_template`:**
|
| 129 |
+
|
| 130 |
+
```python
|
| 131 |
+
intent_template = """
|
| 132 |
+
[INST]
|
| 133 |
+
**CRITICAL INSTRUCTION:** Classify the user's query into ONLY ONE of two categories: "Product" or "Off-Topic".
|
| 134 |
+
Your response MUST be EXACTLY "Product" or EXACTLY "Off-Topic".
|
| 135 |
+
|
| 136 |
+
**EXAMPLES:**
|
| 137 |
+
Query: How is the battery life?
|
| 138 |
+
Classification: Product
|
| 139 |
+
Query: What are the complaints about the screen?
|
| 140 |
+
Classification: Product
|
| 141 |
+
Query: Does it come in blue?
|
| 142 |
+
Classification: Product
|
| 143 |
+
Query: What is the capital of France?
|
| 144 |
+
Classification: Off-Topic
|
| 145 |
+
Query: Hello there
|
| 146 |
+
Classification: Off-Topic
|
| 147 |
+
Query: Who are you?
|
| 148 |
+
Classification: Off-Topic
|
| 149 |
+
|
| 150 |
+
**NOW CLASSIFY THIS QUERY:**
|
| 151 |
+
Query: {query}
|
| 152 |
+
[/INST]
|
| 153 |
+
Classification:"""
|
| 154 |
+
```
|
| 155 |
|
| 156 |
+
**RAG Generation (`qa_system_prompt`):**
|
| 157 |
|
| 158 |
+
* Initial simple prompts led to significant hallucination and context leakage.
|
| 159 |
|
| 160 |
+
* Adding strict rules improved grounding but sometimes made the model overly cautious, failing to find information present in the context.
|
| 161 |
|
| 162 |
+
* Explicitly addressing failure modes (like comparisons) helped for those specific cases.
|
| 163 |
|
| 164 |
+
* Experimenting with different chain types (`stuff`, `map_reduce`, `refine`) showed limitations related to context window size and model instruction following. `stuff` with `ConversationalRetrievalChain` proved most practical.
|
| 165 |
|
| 166 |
+
**Final qa_system_prompt (within qa_prompt):**
|
| 167 |
|
| 168 |
+
```python
|
| 169 |
+
# RAG System Prompt (qa_system_prompt)
|
| 170 |
+
qa_system_prompt = """[INST]You are a factual assistant providing answers based **only** on the customer reviews provided.
|
| 171 |
+
Your task is to answer the user's question concisely using information explicitly found in the 'CONTEXT' snippets below.
|
| 172 |
|
| 173 |
+
**CRITICAL RULES TO FOLLOW:**
|
| 174 |
+
1. **STRICTLY Contextual:** Base your answer ENTIRELY and ONLY on the information within the 'CONTEXT' section. Do NOT use any prior knowledge or external information.
|
| 175 |
+
2. **Direct & Relevant:** Answer ONLY the specific question asked. Do NOT include details from the context that are irrelevant to the question, even if they appear nearby.
|
| 176 |
+
3. **Synthesize Concisely:** Combine relevant facts from potentially multiple snippets into a brief answer (usually 1-3 sentences). Do NOT quote long passages unless absolutely necessary.
|
| 177 |
+
4. **No Comparisons Outside Context:** If the question asks to compare the product to something *not mentioned* in the reviews, state *only*: "Cannot compare based on the provided reviews." Do not add details about the product itself in this case.
|
| 178 |
+
5. **Handle Missing Info Carefully:** If, after carefully reading the context, you genuinely cannot find any information relevant to the question, state *only*: "Based on the provided reviews, I cannot find information about that." Check thoroughly before using this response.
|
| 179 |
+
6. **Factual Tone:** Do NOT apologize, express opinions, make recommendations, or use conversational filler. Just state the facts found in the reviews.
|
| 180 |
+
|
| 181 |
+
CONTEXT:
|
| 182 |
+
---
|
| 183 |
+
{context}
|
| 184 |
+
---
|
| 185 |
+
|
| 186 |
+
QUESTION: {question} [/INST]
|
| 187 |
+
Answer:"""
|
| 188 |
+
```
|
| 189 |
+
|
| 190 |
+
This iterative process demonstrates the practical challenges and refinement needed when working with local LLMs in complex pipelines.
|
| 191 |
|
| 192 |
---
|
| 193 |
|
| 194 |
## 🔮 Future Improvements
|
| 195 |
|
| 196 |
+
* **RAG Evaluation**: Fully implement and integrate RAGAs (or TruLens) evaluation using the local LLM or a free tier API to get quantitative metrics on Faithfulness, Answer Relevancy, etc.
|
| 197 |
+
|
| 198 |
+
* **LLM Upgrade**: Experiment with larger or more advanced instruction-tuned models (e.g., Mixtral GGUF, Llama 3 70/8B Instruct GGUF, or API-based models like GPT-4/Claude 3) to achieve higher consistency in instruction following and synthesis.
|
| 199 |
+
|
| 200 |
+
* **Advanced Retrieval**: Explore more sophisticated retrieval techniques (e.g., HyDE, MultiQueryRetriever, Re-ranking) to improve the quality of context chunks passed to the LLM, potentially reducing generation errors.
|
| 201 |
|
| 202 |
+
* **Batch Processing for Analysis**: Re-implement batch processing for Phase 1 using techniques like `map_reduce` to handle large numbers of reviews that exceed the LLM's context window.
|
| 203 |
|
| 204 |
+
* **Error Handling & UI**: Add more granular error handling and user feedback in the Gradio UI (e.g., clearer messages if context loading fails).
|
| 205 |
+
|
| 206 |
+
* **Automated Testing**: Implement unit and integration tests using `pytest` for the core logic in `src/pipeline.py`.
|
| 207 |
|
| 208 |
---
|
| 209 |
|
| 210 |
## ⚙️ Setup and Installation
|
| 211 |
|
| 212 |
+
**1. Clone the Repository**
|
| 213 |
|
| 214 |
```bash
|
| 215 |
+
git clone [https://github.com/Deathshot78/ReviewSense.git](https://github.com/Deathshot78/ReviewSense.git) # Replace with your repo URL if different
|
| 216 |
cd ReviewSense
|
| 217 |
```
|
| 218 |
|
| 219 |
+
**2. Install Required Packages**
|
| 220 |
|
| 221 |
```bash
|
| 222 |
pip install -r requirements.txt
|
| 223 |
```
|
| 224 |
|
| 225 |
+
**3. Download LLM Model**
|
| 226 |
|
| 227 |
+
The scripts will attempt to download the Mistral-7B GGUF model (`mistral-7b-instruct-v0.1.Q4_K_M.gguf`, ~4.4GB) automatically via `wget` on the first run if it's not found in the root directory. You can also download it manually from [Hugging Face](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF) and place it in the project root.
|
| 228 |
|
| 229 |
+
---
|
| 230 |
|
| 231 |
+
## ▶️ Usage
|
| 232 |
|
| 233 |
+
**Web App (Gradio)**
|
|
|
|
|
|
|
| 234 |
|
| 235 |
+
Run the Gradio app:
|
| 236 |
|
| 237 |
```bash
|
| 238 |
+
python app.py
|
| 239 |
```
|
| 240 |
|
| 241 |
+
Access the interface in your browser
|
|
|
|
|
|
|
| 242 |
|
| 243 |
+
* **Tab 1 ("Batch Analyzer"):** Paste reviews or upload a file to perform Summary, Aspect Extraction, and Sentiment Analysis. This does not affect the chatbot context.
|
| 244 |
|
| 245 |
+
* **Tab 2 ("Ask a Question"):** Chat with the RAG bot. Use the file upload and "Update Chatbot Context" button within this tab to change the reviews the chatbot uses. Use "Reset Chatbot Context to Default" to revert to the built-in laptop reviews. Use "Reset Chat Memory" to clear the conversation history.
|
|
|
|
|
|
|
| 246 |
|
| 247 |
---
|
| 248 |
|
| 249 |
+
## 📁 Project Structure (v2.0)
|
| 250 |
|
| 251 |
```bash
|
| 252 |
+
ReviewSense/
|
| 253 |
+
├── 📄 README.md # Project documentation
|
| 254 |
+
├── 📁 src/ # Source code for core
|
| 255 |
+
│ ├── 📄 app.py # Gradio web application
|
| 256 |
+
│ ├── 📄 pipeline.py # Core functions for analysis, RAG, etc.
|
| 257 |
+
│ └── 📄 main.py # Command-line execution
|
| 258 |
+
├── 📄 requirements.txt # Python dependencies
|
| 259 |
+
├── 📄 .gitignore # Files ignored by Git
|
| 260 |
+
└── 🖼️ assets/ # images
|
| 261 |
```
|
| 262 |
|
| 263 |
---
|
| 264 |
|
| 265 |
+
## 🛠️ Technologies and Models (v2.0)
|
| 266 |
|
| 267 |
**Core Technologies**
|
| 268 |
|
| 269 |
+
* Python 3.10+
|
| 270 |
+
|
| 271 |
+
* LangChain: Orchestration, Chains (ConversationalRetrievalChain), Memory, Prompts
|
| 272 |
+
|
| 273 |
+
* llama-cpp-python: Local execution of GGUF models on CPU
|
| 274 |
|
| 275 |
+
* FAISS (faiss-cpu): Efficient vector similarity search
|
| 276 |
|
| 277 |
+
* Sentence-Transformers (all-MiniLM-L6-v2): Text embeddings
|
| 278 |
|
| 279 |
+
* Gradio: Interactive web UI
|
| 280 |
|
| 281 |
+
* PyTorch (dependency via transformers/sentence-transformers)
|
| 282 |
+
|
| 283 |
+
* Pandas, NumPy (standard data handling)
|
| 284 |
+
|
| 285 |
+
**Core LLM**
|
| 286 |
+
|
| 287 |
+
* Mistral 7B Instruct v0.1 (GGUF Q4_K_M): Used for all NLP tasks (Analysis, RAG Generation, Intent Classification). Downloaded from TheBloke on Hugging Face.
|
| 288 |
+
|
| 289 |
+
---
|
| 290 |
|
| 291 |
+
## 📜 Version History
|
| 292 |
|
| 293 |
+
* v2.0 (Current): RAG Chatbot, Single Mistral 7B model, Dynamic Context, Memory, Guardrails, Gradio UI, Code Refactoring.
|
| 294 |
|
| 295 |
+
* v1.0: [Link to v1.0 Release/Tag on GitHub, e.g., https://www.google.com/search?q=https://github.com/Deathshot78/ReviewSense/releases/tag/v1.0] - Initial Batch Analysis Engine using multiple specialized models (DistilBERT, DistilBART, etc.). Focused on Sentiment, Aspects, and Summarization. (See v1.0 README for full details).
|
assets/{confusion_bay.png → tab1.PNG}
RENAMED
|
File without changes
|
assets/{confusion_bert.png → tab2.PNG}
RENAMED
|
File without changes
|
assets/wordcloud.png
DELETED
Git LFS Details
|
checkpoints/sentiment-binary-best-checkpoint.ckpt
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:b80c0f5882524f5859ac6a92f3311a2d8d4638bd6ef1236232fbb32057b43f3d
|
| 3 |
-
size 803593979
|
|
|
|
|
|
|
|
|
|
|
|
notebooks/reviewsense.ipynb
DELETED
|
@@ -1,1001 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"cells": [
|
| 3 |
-
{
|
| 4 |
-
"cell_type": "markdown",
|
| 5 |
-
"id": "1754f3bb",
|
| 6 |
-
"metadata": {},
|
| 7 |
-
"source": [
|
| 8 |
-
"# 🛍️ ReviewSense: Product Review Analysis Engine\n",
|
| 9 |
-
"\n",
|
| 10 |
-
"> *ReviewSense is a comprehensive, end-to-end Natural Language Processing application built to extract deep, actionable insights from unstructured product reviews.* \n",
|
| 11 |
-
"Where a simple star rating only tells part of the story, ReviewSense dives into the text to uncover what customers are saying, why they're saying it, and how they feel about specific product features. "
|
| 12 |
-
]
|
| 13 |
-
},
|
| 14 |
-
{
|
| 15 |
-
"cell_type": "markdown",
|
| 16 |
-
"id": "00d383d6",
|
| 17 |
-
"metadata": {},
|
| 18 |
-
"source": [
|
| 19 |
-
"## Imports"
|
| 20 |
-
]
|
| 21 |
-
},
|
| 22 |
-
{
|
| 23 |
-
"cell_type": "code",
|
| 24 |
-
"execution_count": null,
|
| 25 |
-
"id": "4d48ba17",
|
| 26 |
-
"metadata": {},
|
| 27 |
-
"outputs": [],
|
| 28 |
-
"source": [
|
| 29 |
-
"import pytorch_lightning as pl\n",
|
| 30 |
-
"from torch.utils.data import DataLoader, Dataset\n",
|
| 31 |
-
"from transformers import AutoTokenizer\n",
|
| 32 |
-
"import pandas as pd\n",
|
| 33 |
-
"from sklearn.model_selection import train_test_split\n",
|
| 34 |
-
"import torch\n",
|
| 35 |
-
"import os\n",
|
| 36 |
-
"import numpy as np\n",
|
| 37 |
-
"from sklearn.model_selection import train_test_split, ParameterGrid, StratifiedKFold\n",
|
| 38 |
-
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
|
| 39 |
-
"from sklearn.naive_bayes import MultinomialNB\n",
|
| 40 |
-
"from sklearn.pipeline import Pipeline\n",
|
| 41 |
-
"from sklearn.metrics import accuracy_score, classification_report, confusion_matrix\n",
|
| 42 |
-
"import seaborn as sns\n",
|
| 43 |
-
"import matplotlib.pyplot as plt\n",
|
| 44 |
-
"from tqdm.notebook import tqdm\n",
|
| 45 |
-
"from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping\n",
|
| 46 |
-
"from pytorch_lightning.loggers import TensorBoardLogger\n",
|
| 47 |
-
"from transformers import T5ForConditionalGeneration, T5Tokenizer\n",
|
| 48 |
-
"from transformers import AutoModelForSequenceClassification, get_linear_schedule_with_warmup, AutoConfig\n",
|
| 49 |
-
"from torch.optim import AdamW\n",
|
| 50 |
-
"import torch\n",
|
| 51 |
-
"from torchmetrics.functional import accuracy\n",
|
| 52 |
-
"from transformers import T5ForConditionalGeneration, T5Tokenizer, AutoTokenizer, pipeline\n",
|
| 53 |
-
"\n"
|
| 54 |
-
]
|
| 55 |
-
},
|
| 56 |
-
{
|
| 57 |
-
"cell_type": "markdown",
|
| 58 |
-
"id": "8263bc02",
|
| 59 |
-
"metadata": {},
|
| 60 |
-
"source": [
|
| 61 |
-
"## Prepare the data"
|
| 62 |
-
]
|
| 63 |
-
},
|
| 64 |
-
{
|
| 65 |
-
"cell_type": "code",
|
| 66 |
-
"execution_count": null,
|
| 67 |
-
"id": "a5f8dcda",
|
| 68 |
-
"metadata": {},
|
| 69 |
-
"outputs": [],
|
| 70 |
-
"source": [
|
| 71 |
-
"def explore_and_preprocess_reviews(\n",
|
| 72 |
-
" train_path='data/train.csv', \n",
|
| 73 |
-
" test_path='data/test.csv',\n",
|
| 74 |
-
" output_dir='data'\n",
|
| 75 |
-
"):\n",
|
| 76 |
-
" \"\"\"\n",
|
| 77 |
-
" Loads the Amazon Sentiment Analysis dataset (https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews)\n",
|
| 78 |
-
" (you need to extract the train/test splits from the zip file in the data folder),\n",
|
| 79 |
-
" performs basic EDA, and preprocesses it for model training.\n",
|
| 80 |
-
"\n",
|
| 81 |
-
" Args:\n",
|
| 82 |
-
" train_path (str): Path to the training CSV file.\n",
|
| 83 |
-
" test_path (str): Path to the testing CSV file.\n",
|
| 84 |
-
" output_dir (str): Directory to save the processed file.\n",
|
| 85 |
-
" \"\"\"\n",
|
| 86 |
-
" # --- 1. Load Data ---\n",
|
| 87 |
-
" # This dataset typically comes without headers. We'll assign them.\n",
|
| 88 |
-
" # Column 1: Sentiment (1 = Negative, 2 = Positive)\n",
|
| 89 |
-
" # Column 2: Title\n",
|
| 90 |
-
" # Column 3: Review Text\n",
|
| 91 |
-
" print(f\"Loading data from '{train_path}' and '{test_path}'...\")\n",
|
| 92 |
-
" try:\n",
|
| 93 |
-
" col_names = ['sentiment_orig', 'title', 'review']\n",
|
| 94 |
-
" train_df = pd.read_csv(train_path, header=None, names=col_names)\n",
|
| 95 |
-
" test_df = pd.read_csv(test_path, header=None, names=col_names)\n",
|
| 96 |
-
" \n",
|
| 97 |
-
" # Combine for unified EDA and preprocessing\n",
|
| 98 |
-
" df = pd.concat([train_df, test_df], ignore_index=True)\n",
|
| 99 |
-
"\n",
|
| 100 |
-
" except FileNotFoundError:\n",
|
| 101 |
-
" print(f\"\\nERROR: Make sure '{train_path}' and '{test_path}' are in the specified directory.\")\n",
|
| 102 |
-
" print(\"This script is designed for the 'Amazon Reviews for Sentiment Analysis' dataset from Kaggle.\")\n",
|
| 103 |
-
" return\n",
|
| 104 |
-
"\n",
|
| 105 |
-
" df.dropna(inplace=True)\n",
|
| 106 |
-
"\n",
|
| 107 |
-
" # --- 2. Preprocessing ---\n",
|
| 108 |
-
" print(\"\\n--- Preprocessing Data for Sentiment Analysis ---\")\n",
|
| 109 |
-
"\n",
|
| 110 |
-
" # a) Create new sentiment labels (0 = Negative, 1 = Positive)\n",
|
| 111 |
-
" # This dataset is binary, not three-class like the previous one.\n",
|
| 112 |
-
" df['sentiment'] = df['sentiment_orig'].apply(lambda x: 0 if x == 1 else 1)\n",
|
| 113 |
-
"\n",
|
| 114 |
-
" # b) Combine title and review body\n",
|
| 115 |
-
" df['full_text'] = df['title'].astype(str) + \". \" + df['review'].astype(str)\n",
|
| 116 |
-
"\n",
|
| 117 |
-
" # c) Select and rename columns\n",
|
| 118 |
-
" processed_df = df[['full_text', 'sentiment']].copy()\n",
|
| 119 |
-
"\n",
|
| 120 |
-
" # --- 4. Save Processed Data ---\n",
|
| 121 |
-
" os.makedirs(output_dir, exist_ok=True)\n",
|
| 122 |
-
" output_path = os.path.join(output_dir, 'reviews_processed.csv')\n",
|
| 123 |
-
" processed_df.to_csv(output_path, index=False)\n",
|
| 124 |
-
" print(f\"\\nSaved {len(processed_df)} processed reviews to '{output_path}'\")\n"
|
| 125 |
-
]
|
| 126 |
-
},
|
| 127 |
-
{
|
| 128 |
-
"cell_type": "code",
|
| 129 |
-
"execution_count": null,
|
| 130 |
-
"id": "60ab838c",
|
| 131 |
-
"metadata": {},
|
| 132 |
-
"outputs": [],
|
| 133 |
-
"source": [
|
| 134 |
-
"#--- Preprocess the Reviews Dataset ---\n",
|
| 135 |
-
"print(\"\\n--- Preprocessing started ---\")\n",
|
| 136 |
-
"explore_and_preprocess_reviews()\n",
|
| 137 |
-
"print(\"\\n--- Preprocessing finished ---\")"
|
| 138 |
-
]
|
| 139 |
-
},
|
| 140 |
-
{
|
| 141 |
-
"cell_type": "markdown",
|
| 142 |
-
"id": "4c381d73",
|
| 143 |
-
"metadata": {},
|
| 144 |
-
"source": [
|
| 145 |
-
"## Define a base model (Multinomial Naive Bayes)"
|
| 146 |
-
]
|
| 147 |
-
},
|
| 148 |
-
{
|
| 149 |
-
"cell_type": "code",
|
| 150 |
-
"execution_count": null,
|
| 151 |
-
"id": "b3cd2b5b",
|
| 152 |
-
"metadata": {},
|
| 153 |
-
"outputs": [],
|
| 154 |
-
"source": [
|
| 155 |
-
"def train_baseline_sentiment_model(data_path='data/reviews_processed.csv', grid_search=True, nb__alpha=0.1, tfidf__max_df=0.75, tfidf__ngram_range=(1, 2), sample_size: int = 50000):\n",
|
| 156 |
-
" \"\"\"\n",
|
| 157 |
-
" Trains and evaluates a Multinomial Naive Bayes model for sentiment analysis.\n",
|
| 158 |
-
" Can optionally perform a grid search.\n",
|
| 159 |
-
"\n",
|
| 160 |
-
" Args:\n",
|
| 161 |
-
" data_path (str): Path to the processed reviews CSV file.\n",
|
| 162 |
-
" grid_search (bool): If True, performs a grid search.\n",
|
| 163 |
-
" nb__alpha (float): Alpha for MultinomialNB.\n",
|
| 164 |
-
" tfidf__max_df (float): max_df for TfidfVectorizer.\n",
|
| 165 |
-
" tfidf__ngram_range (tuple): ngram_range for TfidfVectorizer.\n",
|
| 166 |
-
" sample_size (int, optional): Number of reviews to use. If None, uses all.\n",
|
| 167 |
-
" \"\"\"\n",
|
| 168 |
-
" # --- 1. Load Data ---\n",
|
| 169 |
-
" print(f\"Loading data from '{data_path}'...\")\n",
|
| 170 |
-
" if not os.path.exists(data_path):\n",
|
| 171 |
-
" print(f\"\\nERROR: '{data_path}' not found. Please run the EDA script first!\")\n",
|
| 172 |
-
" return\n",
|
| 173 |
-
" \n",
|
| 174 |
-
" df = pd.read_csv(data_path)\n",
|
| 175 |
-
" df.dropna(inplace=True)\n",
|
| 176 |
-
"\n",
|
| 177 |
-
" # --- 2. Sample Data ---\n",
|
| 178 |
-
" if sample_size:\n",
|
| 179 |
-
" print(f\"Using a sample of {sample_size} reviews for training the baseline model.\")\n",
|
| 180 |
-
" df = df.sample(n=sample_size, random_state=42)\n",
|
| 181 |
-
"\n",
|
| 182 |
-
" # --- 3. Train-Test Split ---\n",
|
| 183 |
-
" print(\"Splitting data into training and testing sets...\")\n",
|
| 184 |
-
" X_train, X_test, y_train, y_test = train_test_split(\n",
|
| 185 |
-
" df['full_text'],\n",
|
| 186 |
-
" df['sentiment'],\n",
|
| 187 |
-
" test_size=0.2,\n",
|
| 188 |
-
" random_state=42,\n",
|
| 189 |
-
" stratify=df['sentiment']\n",
|
| 190 |
-
" )\n",
|
| 191 |
-
"\n",
|
| 192 |
-
" # --- 4. Create a Pipeline ---\n",
|
| 193 |
-
" pipeline = Pipeline([\n",
|
| 194 |
-
" ('tfidf', TfidfVectorizer(stop_words='english')),\n",
|
| 195 |
-
" ('nb', MultinomialNB()),\n",
|
| 196 |
-
" ])\n",
|
| 197 |
-
"\n",
|
| 198 |
-
" best_params = None\n",
|
| 199 |
-
"\n",
|
| 200 |
-
" if grid_search:\n",
|
| 201 |
-
" # --- 5a. Perform Grid Search ---\n",
|
| 202 |
-
" print(\"Performing Grid Search to find the best hyperparameters...\")\n",
|
| 203 |
-
" parameters = {\n",
|
| 204 |
-
" 'tfidf__ngram_range': [(1, 1), (1, 2)],\n",
|
| 205 |
-
" 'tfidf__max_df': [0.5, 0.75, 1.0],\n",
|
| 206 |
-
" 'nb__alpha': [0.1, 0.5, 1.0],\n",
|
| 207 |
-
" }\n",
|
| 208 |
-
" param_grid = list(ParameterGrid(parameters))\n",
|
| 209 |
-
" best_score = -1\n",
|
| 210 |
-
"\n",
|
| 211 |
-
" for params in tqdm(param_grid, desc=\"Grid Search Progress\"):\n",
|
| 212 |
-
" pipeline.set_params(**params)\n",
|
| 213 |
-
" pipeline.fit(X_train, y_train)\n",
|
| 214 |
-
" score = pipeline.score(X_test, y_test)\n",
|
| 215 |
-
" if score > best_score:\n",
|
| 216 |
-
" best_score = score\n",
|
| 217 |
-
" best_params = params\n",
|
| 218 |
-
" \n",
|
| 219 |
-
" print(f\"\\nBest score on test set: {best_score:.4f}\")\n",
|
| 220 |
-
" print(\"Best parameters found:\")\n",
|
| 221 |
-
" print(best_params)\n",
|
| 222 |
-
"\n",
|
| 223 |
-
" else:\n",
|
| 224 |
-
" # --- 5b. Use provided hyperparameters ---\n",
|
| 225 |
-
" print(\"Skipping grid search and using provided hyperparameters...\")\n",
|
| 226 |
-
" best_params = {\n",
|
| 227 |
-
" 'nb__alpha': nb__alpha,\n",
|
| 228 |
-
" 'tfidf__max_df': tfidf__max_df,\n",
|
| 229 |
-
" 'tfidf__ngram_range': tfidf__ngram_range\n",
|
| 230 |
-
" }\n",
|
| 231 |
-
"\n",
|
| 232 |
-
" # --- 6. Train the Final Model ---\n",
|
| 233 |
-
" print(\"\\nTraining final model...\")\n",
|
| 234 |
-
" best_model = pipeline.set_params(**best_params)\n",
|
| 235 |
-
" best_model.fit(X_train, y_train)\n",
|
| 236 |
-
" print(\"Model training complete.\")\n",
|
| 237 |
-
"\n",
|
| 238 |
-
" # --- 7. Evaluate the Best Model ---\n",
|
| 239 |
-
" print(\"\\n--- Model Evaluation ---\")\n",
|
| 240 |
-
" y_pred = best_model.predict(X_test)\n",
|
| 241 |
-
" \n",
|
| 242 |
-
" accuracy = accuracy_score(y_test, y_pred)\n",
|
| 243 |
-
" target_names = ['Negative', 'Positive']\n",
|
| 244 |
-
" \n",
|
| 245 |
-
" print(f\"Accuracy: {accuracy:.4f}\")\n",
|
| 246 |
-
" print(\"\\nClassification Report:\")\n",
|
| 247 |
-
" print(classification_report(y_test, y_pred, target_names=target_names))\n",
|
| 248 |
-
" \n",
|
| 249 |
-
" print(\"Confusion Matrix:\")\n",
|
| 250 |
-
" cm = confusion_matrix(y_test, y_pred)\n",
|
| 251 |
-
" plt.figure(figsize=(8, 6))\n",
|
| 252 |
-
" sns.heatmap(cm, annot=True, fmt='d', cmap='Greens', \n",
|
| 253 |
-
" xticklabels=target_names, yticklabels=target_names)\n",
|
| 254 |
-
" plt.title('Confusion Matrix for Naive Bayes on Amazon Reviews')\n",
|
| 255 |
-
" plt.xlabel('Predicted Label')\n",
|
| 256 |
-
" plt.ylabel('True Label')\n",
|
| 257 |
-
" plt.show()"
|
| 258 |
-
]
|
| 259 |
-
},
|
| 260 |
-
{
|
| 261 |
-
"cell_type": "code",
|
| 262 |
-
"execution_count": null,
|
| 263 |
-
"id": "093e6ae9",
|
| 264 |
-
"metadata": {},
|
| 265 |
-
"outputs": [],
|
| 266 |
-
"source": [
|
| 267 |
-
"#--- Train the base model ---\n",
|
| 268 |
-
"train_baseline_sentiment_model(sample_size=150000, grid_search=False)"
|
| 269 |
-
]
|
| 270 |
-
},
|
| 271 |
-
{
|
| 272 |
-
"cell_type": "markdown",
|
| 273 |
-
"id": "71f5e4ba",
|
| 274 |
-
"metadata": {},
|
| 275 |
-
"source": [
|
| 276 |
-
"## Define the dataset and lightning DataModule"
|
| 277 |
-
]
|
| 278 |
-
},
|
| 279 |
-
{
|
| 280 |
-
"cell_type": "code",
|
| 281 |
-
"execution_count": null,
|
| 282 |
-
"id": "c977e0f4",
|
| 283 |
-
"metadata": {},
|
| 284 |
-
"outputs": [],
|
| 285 |
-
"source": [
|
| 286 |
-
"class ReviewDataset(Dataset):\n",
|
| 287 |
-
" \"\"\"\n",
|
| 288 |
-
" Custom PyTorch Dataset for Amazon Reviews.\n",
|
| 289 |
-
"\n",
|
| 290 |
-
" This class takes a pandas DataFrame of review data, a tokenizer, and a max\n",
|
| 291 |
-
" token length, and prepares it for use in a PyTorch model. It handles the\n",
|
| 292 |
-
" tokenization of the text and the formatting of the labels for each item.\n",
|
| 293 |
-
"\n",
|
| 294 |
-
" Attributes:\n",
|
| 295 |
-
" tokenizer: The Hugging Face tokenizer to use for processing text.\n",
|
| 296 |
-
" data (pd.DataFrame): The DataFrame containing the review data.\n",
|
| 297 |
-
" max_token_len (int): The maximum sequence length for the tokenizer.\n",
|
| 298 |
-
" \"\"\"\n",
|
| 299 |
-
" def __init__(self, data: pd.DataFrame, tokenizer, max_token_len: int):\n",
|
| 300 |
-
" \"\"\"\n",
|
| 301 |
-
" Initializes the ReviewDataset.\n",
|
| 302 |
-
"\n",
|
| 303 |
-
" Args:\n",
|
| 304 |
-
" data (pd.DataFrame): The input DataFrame containing 'full_text' and\n",
|
| 305 |
-
" 'sentiment' columns.\n",
|
| 306 |
-
" tokenizer: The pre-trained tokenizer instance.\n",
|
| 307 |
-
" max_token_len (int): The maximum length for tokenized sequences.\n",
|
| 308 |
-
" \"\"\"\n",
|
| 309 |
-
" self.tokenizer = tokenizer\n",
|
| 310 |
-
" self.data = data\n",
|
| 311 |
-
" self.max_token_len = max_token_len\n",
|
| 312 |
-
"\n",
|
| 313 |
-
" def __len__(self):\n",
|
| 314 |
-
" \"\"\"\n",
|
| 315 |
-
" Returns the total number of samples in the dataset.\n",
|
| 316 |
-
" \"\"\"\n",
|
| 317 |
-
" return len(self.data)\n",
|
| 318 |
-
"\n",
|
| 319 |
-
" def __getitem__(self, index: int):\n",
|
| 320 |
-
" \"\"\"\n",
|
| 321 |
-
" Retrieves one sample from the dataset at the specified index.\n",
|
| 322 |
-
"\n",
|
| 323 |
-
" This method handles the tokenization of a single review text, including\n",
|
| 324 |
-
" padding and truncation, and formats the output into a dictionary of\n",
|
| 325 |
-
" tensors ready for the model.\n",
|
| 326 |
-
"\n",
|
| 327 |
-
" Args:\n",
|
| 328 |
-
" index (int): The index of the data sample to retrieve.\n",
|
| 329 |
-
"\n",
|
| 330 |
-
" Returns:\n",
|
| 331 |
-
" dict: A dictionary containing the tokenized inputs and the label,\n",
|
| 332 |
-
" with the following keys:\n",
|
| 333 |
-
" - 'input_ids': The token IDs of the review text.\n",
|
| 334 |
-
" - 'attention_mask': The attention mask for the review text.\n",
|
| 335 |
-
" - 'labels': The sentiment label as a tensor.\n",
|
| 336 |
-
" \"\"\"\n",
|
| 337 |
-
" data_row = self.data.iloc[index]\n",
|
| 338 |
-
" text = str(data_row.full_text)\n",
|
| 339 |
-
" labels = data_row.sentiment\n",
|
| 340 |
-
"\n",
|
| 341 |
-
" encoding = self.tokenizer.encode_plus(\n",
|
| 342 |
-
" text,\n",
|
| 343 |
-
" add_special_tokens=True,\n",
|
| 344 |
-
" max_length=self.max_token_len,\n",
|
| 345 |
-
" return_token_type_ids=False,\n",
|
| 346 |
-
" padding=\"max_length\",\n",
|
| 347 |
-
" truncation=True,\n",
|
| 348 |
-
" return_attention_mask=True,\n",
|
| 349 |
-
" return_tensors='pt',\n",
|
| 350 |
-
" )\n",
|
| 351 |
-
"\n",
|
| 352 |
-
" return dict(\n",
|
| 353 |
-
" input_ids=encoding[\"input_ids\"].flatten(),\n",
|
| 354 |
-
" attention_mask=encoding[\"attention_mask\"].flatten(),\n",
|
| 355 |
-
" labels=torch.tensor(labels, dtype=torch.long)\n",
|
| 356 |
-
" )\n",
|
| 357 |
-
"\n",
|
| 358 |
-
"class ReviewDataModule(pl.LightningDataModule):\n",
|
| 359 |
-
" \"\"\"\n",
|
| 360 |
-
" PyTorch Lightning DataModule to handle the Amazon Reviews dataset.\n",
|
| 361 |
-
"\n",
|
| 362 |
-
" This class encapsulates all the steps needed to process the data:\n",
|
| 363 |
-
" loading, splitting, and creating PyTorch DataLoaders for training,\n",
|
| 364 |
-
" validation, and testing. It allows for using a smaller random sample of the\n",
|
| 365 |
-
" full dataset for faster experimentation.\n",
|
| 366 |
-
"\n",
|
| 367 |
-
" Attributes:\n",
|
| 368 |
-
" data_path (str): Path to the processed CSV file.\n",
|
| 369 |
-
" batch_size (int): The size of each data batch.\n",
|
| 370 |
-
" max_token_len (int): The maximum sequence length for the tokenizer.\n",
|
| 371 |
-
" tokenizer: The Hugging Face tokenizer instance.\n",
|
| 372 |
-
" num_workers (int): The number of CPU cores to use for data loading.\n",
|
| 373 |
-
" sample_size (int, optional): The number of samples to use. If None,\n",
|
| 374 |
-
" the full dataset is used.\n",
|
| 375 |
-
" \"\"\"\n",
|
| 376 |
-
" def __init__(self, data_path: str, batch_size: int = 16, max_token_len: int = 256, model_name='distilbert-base-uncased', num_workers: int = 0, sample_size: int = None):\n",
|
| 377 |
-
" \"\"\"\n",
|
| 378 |
-
" Initializes the ReviewDataModule.\n",
|
| 379 |
-
"\n",
|
| 380 |
-
" Args:\n",
|
| 381 |
-
" data_path (str): The path to the processed CSV data file.\n",
|
| 382 |
-
" batch_size (int): The number of samples per batch.\n",
|
| 383 |
-
" max_token_len (int): Maximum length of tokenized sequences.\n",
|
| 384 |
-
" model_name (str): The name of the pre-trained model to use for the tokenizer.\n",
|
| 385 |
-
" num_workers (int): Number of subprocesses to use for data loading.\n",
|
| 386 |
-
" sample_size (int, optional): If specified, a random sample of this\n",
|
| 387 |
-
" size will be used from the dataset.\n",
|
| 388 |
-
" Defaults to None, which uses the full dataset.\n",
|
| 389 |
-
" \"\"\"\n",
|
| 390 |
-
" super().__init__()\n",
|
| 391 |
-
" self.data_path = data_path\n",
|
| 392 |
-
" self.batch_size = batch_size\n",
|
| 393 |
-
" self.max_token_len = max_token_len\n",
|
| 394 |
-
" self.tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
|
| 395 |
-
" self.num_workers = num_workers\n",
|
| 396 |
-
" self.sample_size = sample_size\n",
|
| 397 |
-
" self.train_df = None\n",
|
| 398 |
-
" self.val_df = None\n",
|
| 399 |
-
" self.test_df = None\n",
|
| 400 |
-
"\n",
|
| 401 |
-
" def setup(self, stage=None):\n",
|
| 402 |
-
" \"\"\"\n",
|
| 403 |
-
" Loads and splits the data for training, validation, and testing.\n",
|
| 404 |
-
"\n",
|
| 405 |
-
" This method is called by PyTorch Lightning. It reads the CSV, handles\n",
|
| 406 |
-
" missing values, optionally takes a random sample, and performs a\n",
|
| 407 |
-
" stratified train-validation-test split. The indices of the resulting\n",
|
| 408 |
-
" DataFrames are reset to prevent potential KeyErrors during data loading.\n",
|
| 409 |
-
" \"\"\"\n",
|
| 410 |
-
" df = pd.read_csv(self.data_path)\n",
|
| 411 |
-
" df.dropna(inplace=True)\n",
|
| 412 |
-
"\n",
|
| 413 |
-
" # If a sample size is provided, sample the dataframe\n",
|
| 414 |
-
" if self.sample_size:\n",
|
| 415 |
-
" print(f\"Using a sample of {self.sample_size} reviews.\")\n",
|
| 416 |
-
" df = df.sample(n=self.sample_size, random_state=42)\n",
|
| 417 |
-
"\n",
|
| 418 |
-
" # Stratified split to maintain label distribution\n",
|
| 419 |
-
" train_val_df, self.test_df = train_test_split(df, test_size=0.1, random_state=42, stratify=df.sentiment)\n",
|
| 420 |
-
" self.train_df, self.val_df = train_test_split(train_val_df, test_size=0.1, random_state=42, stratify=train_val_df.sentiment)\n",
|
| 421 |
-
"\n",
|
| 422 |
-
" # Reset indices to prevent KeyErrors\n",
|
| 423 |
-
" self.train_df = self.train_df.reset_index(drop=True)\n",
|
| 424 |
-
" self.val_df = self.val_df.reset_index(drop=True)\n",
|
| 425 |
-
" self.test_df = self.test_df.reset_index(drop=True)\n",
|
| 426 |
-
"\n",
|
| 427 |
-
" print(f\"Size of training set: {len(self.train_df)}\")\n",
|
| 428 |
-
" print(f\"Size of validation set: {len(self.val_df)}\")\n",
|
| 429 |
-
" print(f\"Size of test set: {len(self.test_df)}\")\n",
|
| 430 |
-
"\n",
|
| 431 |
-
" def train_dataloader(self):\n",
|
| 432 |
-
" \"\"\"Returns the DataLoader for the training set.\"\"\"\n",
|
| 433 |
-
" return DataLoader(\n",
|
| 434 |
-
" ReviewDataset(self.train_df, self.tokenizer, self.max_token_len),\n",
|
| 435 |
-
" batch_size=self.batch_size,\n",
|
| 436 |
-
" shuffle=True,\n",
|
| 437 |
-
" num_workers=self.num_workers\n",
|
| 438 |
-
" )\n",
|
| 439 |
-
"\n",
|
| 440 |
-
" def val_dataloader(self):\n",
|
| 441 |
-
" \"\"\"Returns the DataLoader for the validation set.\"\"\"\n",
|
| 442 |
-
" return DataLoader(\n",
|
| 443 |
-
" ReviewDataset(self.val_df, self.tokenizer, self.max_token__len),\n",
|
| 444 |
-
" batch_size=self.batch_size,\n",
|
| 445 |
-
" num_workers=self.num_workers\n",
|
| 446 |
-
" )\n",
|
| 447 |
-
"\n",
|
| 448 |
-
" def test_dataloader(self):\n",
|
| 449 |
-
" \"\"\"Returns the DataLoader for the test set.\"\"\"\n",
|
| 450 |
-
" return DataLoader(\n",
|
| 451 |
-
" ReviewDataset(self.test_df, self.tokenizer, self.max_token_len),\n",
|
| 452 |
-
" batch_size=self.batch_size,\n",
|
| 453 |
-
" num_workers=self.num_workers\n",
|
| 454 |
-
" )\n",
|
| 455 |
-
" "
|
| 456 |
-
]
|
| 457 |
-
},
|
| 458 |
-
{
|
| 459 |
-
"cell_type": "code",
|
| 460 |
-
"execution_count": null,
|
| 461 |
-
"id": "985ac47b",
|
| 462 |
-
"metadata": {},
|
| 463 |
-
"outputs": [],
|
| 464 |
-
"source": [
|
| 465 |
-
"# --- Configuration ---\n",
|
| 466 |
-
"data_path = \"data/reviews_processed.csv\"\n",
|
| 467 |
-
"BATCH_SIZE = 64\n",
|
| 468 |
-
"MAX_TOKEN_LEN = 256\n",
|
| 469 |
-
"\n",
|
| 470 |
-
"print(\"Initializing ReviewDataModule...\")\n",
|
| 471 |
-
"review_datamodule = ReviewDataModule(\n",
|
| 472 |
-
" data_path=data_path,\n",
|
| 473 |
-
" batch_size=BATCH_SIZE,\n",
|
| 474 |
-
" max_token_len=MAX_TOKEN_LEN,\n",
|
| 475 |
-
" model_name='distilbert-base-uncased',\n",
|
| 476 |
-
" sample_size=100000 # Pass the sample size to the datamodule\n",
|
| 477 |
-
")\n",
|
| 478 |
-
"review_datamodule.setup()\n",
|
| 479 |
-
"\n",
|
| 480 |
-
"# Fetch one batch from the training dataloader to inspect its contents\n",
|
| 481 |
-
"print(\"\\n--- Fetching one batch from the training dataloader ---\")\n",
|
| 482 |
-
"train_batch = next(iter(review_datamodule.train_dataloader()))\n",
|
| 483 |
-
"\n",
|
| 484 |
-
"print(\"\\n--- Example Batch ---\")\n",
|
| 485 |
-
"print(f\"Input IDs shape: {train_batch['input_ids'].shape}\")\n",
|
| 486 |
-
"print(f\"Attention Mask shape: {train_batch['attention_mask'].shape}\")\n",
|
| 487 |
-
"print(f\"Labels: {train_batch['labels']}\")\n",
|
| 488 |
-
"print(f\"Labels shape: {train_batch['labels'].shape}\")"
|
| 489 |
-
]
|
| 490 |
-
},
|
| 491 |
-
{
|
| 492 |
-
"cell_type": "markdown",
|
| 493 |
-
"id": "2c7781f4",
|
| 494 |
-
"metadata": {},
|
| 495 |
-
"source": [
|
| 496 |
-
"## FineTune DistilBert"
|
| 497 |
-
]
|
| 498 |
-
},
|
| 499 |
-
{
|
| 500 |
-
"cell_type": "code",
|
| 501 |
-
"execution_count": null,
|
| 502 |
-
"id": "d046b940",
|
| 503 |
-
"metadata": {},
|
| 504 |
-
"outputs": [],
|
| 505 |
-
"source": [
|
| 506 |
-
"class SentimentClassifier(pl.LightningModule):\n",
|
| 507 |
-
" \"\"\"\n",
|
| 508 |
-
" PyTorch Lightning module for the sentiment classification model.\n",
|
| 509 |
-
" \"\"\"\n",
|
| 510 |
-
" def __init__(self, model_name='distilbert-base-uncased', n_classes=2, learning_rate=2e-5, n_warmup_steps=0, n_training_steps=0, dropout_prob=0.2): # Added dropout\n",
|
| 511 |
-
" super().__init__()\n",
|
| 512 |
-
" self.save_hyperparameters()\n",
|
| 513 |
-
"\n",
|
| 514 |
-
" # Configure dropout\n",
|
| 515 |
-
" config = AutoConfig.from_pretrained(model_name)\n",
|
| 516 |
-
" config.hidden_dropout_prob = dropout_prob\n",
|
| 517 |
-
" config.attention_probs_dropout_prob = dropout_prob\n",
|
| 518 |
-
" config.num_labels = n_classes\n",
|
| 519 |
-
"\n",
|
| 520 |
-
" self.model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config)\n",
|
| 521 |
-
"\n",
|
| 522 |
-
" def forward(self, input_ids, attention_mask, labels=None):\n",
|
| 523 |
-
" return self.model(\n",
|
| 524 |
-
" input_ids=input_ids,\n",
|
| 525 |
-
" attention_mask=attention_mask,\n",
|
| 526 |
-
" labels=labels\n",
|
| 527 |
-
" )\n",
|
| 528 |
-
"\n",
|
| 529 |
-
" def training_step(self, batch, batch_idx):\n",
|
| 530 |
-
" output = self.forward(**batch)\n",
|
| 531 |
-
" self.log(\"train_loss\", output.loss, prog_bar=True, logger=True)\n",
|
| 532 |
-
" return output.loss\n",
|
| 533 |
-
"\n",
|
| 534 |
-
" def validation_step(self, batch, batch_idx):\n",
|
| 535 |
-
" output = self.forward(**batch)\n",
|
| 536 |
-
" preds = torch.argmax(output.logits, dim=1)\n",
|
| 537 |
-
" val_acc = accuracy(preds, batch['labels'], task='binary')\n",
|
| 538 |
-
" self.log(\"val_loss\", output.loss, prog_bar=True, logger=True)\n",
|
| 539 |
-
" self.log(\"val_accuracy\", val_acc, prog_bar=True, logger=True)\n",
|
| 540 |
-
" return output.loss\n",
|
| 541 |
-
"\n",
|
| 542 |
-
" def test_step(self, batch, batch_idx):\n",
|
| 543 |
-
" output = self.forward(**batch)\n",
|
| 544 |
-
" preds = torch.argmax(output.logits, dim=1)\n",
|
| 545 |
-
" test_acc = accuracy(preds, batch['labels'], task='binary')\n",
|
| 546 |
-
" self.log(\"test_accuracy\", test_acc)\n",
|
| 547 |
-
" return test_acc\n",
|
| 548 |
-
"\n",
|
| 549 |
-
" def predict_step(self, batch, batch_idx, dataloader_idx=0):\n",
|
| 550 |
-
" output = self.forward(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'])\n",
|
| 551 |
-
" return torch.argmax(output.logits, dim=1)\n",
|
| 552 |
-
"\n",
|
| 553 |
-
" def configure_optimizers(self):\n",
|
| 554 |
-
" optimizer = AdamW(self.parameters(), lr=self.hparams.learning_rate, weight_decay=0.01)\n",
|
| 555 |
-
" scheduler = get_linear_schedule_with_warmup(\n",
|
| 556 |
-
" optimizer,\n",
|
| 557 |
-
" num_warmup_steps=self.hparams.n_warmup_steps,\n",
|
| 558 |
-
" num_training_steps=self.hparams.n_training_steps\n",
|
| 559 |
-
" )\n",
|
| 560 |
-
" return dict(optimizer=optimizer, lr_scheduler=dict(scheduler=scheduler, interval='step'))\n"
|
| 561 |
-
]
|
| 562 |
-
},
|
| 563 |
-
{
|
| 564 |
-
"cell_type": "code",
|
| 565 |
-
"execution_count": null,
|
| 566 |
-
"id": "b3a3708d",
|
| 567 |
-
"metadata": {},
|
| 568 |
-
"outputs": [],
|
| 569 |
-
"source": [
|
| 570 |
-
"def train_sentiment_model(data_path='data/reviews_processed.csv', model_name='distilbert-base-uncased', n_epochs=5, sample_size: int = None):\n",
|
| 571 |
-
" \"\"\"\n",
|
| 572 |
-
" Main function to train the sentiment analysis model on the Amazon Reviews dataset.\n",
|
| 573 |
-
"\n",
|
| 574 |
-
" Args:\n",
|
| 575 |
-
" data_path (str): Path to the processed data file.\n",
|
| 576 |
-
" model_name (str): Name of the transformer model to use.\n",
|
| 577 |
-
" n_epochs (int): Maximum number of epochs for training.\n",
|
| 578 |
-
" sample_size (int, optional): The number of reviews to use for training.\n",
|
| 579 |
-
" If None, the full dataset is used.\n",
|
| 580 |
-
" \"\"\"\n",
|
| 581 |
-
" # --- 1. Hyperparameters ---\n",
|
| 582 |
-
" BATCH_SIZE = 64\n",
|
| 583 |
-
" MAX_TOKEN_LEN = 256\n",
|
| 584 |
-
" LEARNING_RATE = 2e-5\n",
|
| 585 |
-
" N_CLASSES = 2 # Negative, Positive\n",
|
| 586 |
-
"\n",
|
| 587 |
-
" # --- 2. Initialize DataModule ---\n",
|
| 588 |
-
" print(\"Initializing ReviewDataModule...\")\n",
|
| 589 |
-
" review_datamodule = ReviewDataModule(\n",
|
| 590 |
-
" data_path=data_path,\n",
|
| 591 |
-
" batch_size=BATCH_SIZE,\n",
|
| 592 |
-
" max_token_len=MAX_TOKEN_LEN,\n",
|
| 593 |
-
" model_name=model_name,\n",
|
| 594 |
-
" sample_size=sample_size # Pass the sample size to the datamodule\n",
|
| 595 |
-
" )\n",
|
| 596 |
-
" review_datamodule.setup()\n",
|
| 597 |
-
"\n",
|
| 598 |
-
" n_training_steps = len(review_datamodule.train_dataloader()) * n_epochs\n",
|
| 599 |
-
" n_warmup_steps = int(n_training_steps * 0.1)\n",
|
| 600 |
-
"\n",
|
| 601 |
-
" # --- 3. Initialize Model ---\n",
|
| 602 |
-
" print(\"Initializing SentimentClassifier model...\")\n",
|
| 603 |
-
" model = SentimentClassifier(\n",
|
| 604 |
-
" model_name=model_name,\n",
|
| 605 |
-
" n_classes=N_CLASSES,\n",
|
| 606 |
-
" learning_rate=LEARNING_RATE,\n",
|
| 607 |
-
" n_warmup_steps=n_warmup_steps,\n",
|
| 608 |
-
" n_training_steps=n_training_steps\n",
|
| 609 |
-
" )\n",
|
| 610 |
-
"\n",
|
| 611 |
-
" # --- 4. Configure Training Callbacks ---\n",
|
| 612 |
-
" checkpoint_callback = ModelCheckpoint(\n",
|
| 613 |
-
" dirpath=\"checkpoints\",\n",
|
| 614 |
-
" filename=\"sentiment-binary-best-checkpoint\",\n",
|
| 615 |
-
" save_top_k=1,\n",
|
| 616 |
-
" verbose=True,\n",
|
| 617 |
-
" monitor=\"val_loss\",\n",
|
| 618 |
-
" mode=\"min\"\n",
|
| 619 |
-
" )\n",
|
| 620 |
-
" logger = TensorBoardLogger(\"lightning_logs\", name=\"sentiment-classifier-binary\")\n",
|
| 621 |
-
" early_stopping_callback = EarlyStopping(monitor='val_loss', patience=2)\n",
|
| 622 |
-
"\n",
|
| 623 |
-
" # --- 5. Initialize Trainer ---\n",
|
| 624 |
-
" print(\"Initializing PyTorch Lightning Trainer...\")\n",
|
| 625 |
-
" trainer = pl.Trainer(\n",
|
| 626 |
-
" logger=logger,\n",
|
| 627 |
-
" callbacks=[checkpoint_callback, early_stopping_callback],\n",
|
| 628 |
-
" max_epochs=n_epochs,\n",
|
| 629 |
-
" accelerator='gpu' if torch.cuda.is_available() else 'cpu',\n",
|
| 630 |
-
" devices=1,\n",
|
| 631 |
-
" )\n",
|
| 632 |
-
"\n",
|
| 633 |
-
" # --- 6. Start Training ---\n",
|
| 634 |
-
" print(f\"Starting training with {model_name} for up to {n_epochs} epochs...\")\n",
|
| 635 |
-
" trainer.fit(model, review_datamodule)\n",
|
| 636 |
-
"\n",
|
| 637 |
-
" # --- 7. Evaluate on Test Set and Generate Confusion Matrix ---\n",
|
| 638 |
-
" print(\"\\nTraining complete. Evaluating on the test set...\")\n",
|
| 639 |
-
" trainer.test(model, datamodule=review_datamodule)\n",
|
| 640 |
-
"\n",
|
| 641 |
-
" predictions = trainer.predict(model, datamodule=review_datamodule)\n",
|
| 642 |
-
" if predictions:\n",
|
| 643 |
-
" all_preds = torch.cat(predictions).cpu().numpy()\n",
|
| 644 |
-
" true_labels = review_datamodule.test_df.sentiment.to_numpy()\n",
|
| 645 |
-
" target_names = ['Negative', 'Positive'] # Updated labels\n",
|
| 646 |
-
"\n",
|
| 647 |
-
" cm = confusion_matrix(true_labels, all_preds)\n",
|
| 648 |
-
" plt.figure(figsize=(8, 6))\n",
|
| 649 |
-
" sns.heatmap(cm, annot=True, fmt='d', cmap='YlGnBu',\n",
|
| 650 |
-
" xticklabels=target_names, yticklabels=target_names)\n",
|
| 651 |
-
" plt.title('Confusion Matrix for Sentiment Analysis')\n",
|
| 652 |
-
" plt.xlabel('Predicted Label')\n",
|
| 653 |
-
" plt.ylabel('True Label')\n",
|
| 654 |
-
" plt.show()\n",
|
| 655 |
-
"\n"
|
| 656 |
-
]
|
| 657 |
-
},
|
| 658 |
-
{
|
| 659 |
-
"cell_type": "code",
|
| 660 |
-
"execution_count": null,
|
| 661 |
-
"id": "3dae58e3",
|
| 662 |
-
"metadata": {},
|
| 663 |
-
"outputs": [],
|
| 664 |
-
"source": [
|
| 665 |
-
"#--- Train DistilBert ---\n",
|
| 666 |
-
"train_sentiment_model(data_path=data_path, sample_size=100000)"
|
| 667 |
-
]
|
| 668 |
-
},
|
| 669 |
-
{
|
| 670 |
-
"cell_type": "markdown",
|
| 671 |
-
"id": "ddbc7315",
|
| 672 |
-
"metadata": {},
|
| 673 |
-
"source": [
|
| 674 |
-
"## Define the models"
|
| 675 |
-
]
|
| 676 |
-
},
|
| 677 |
-
{
|
| 678 |
-
"cell_type": "code",
|
| 679 |
-
"execution_count": null,
|
| 680 |
-
"id": "85bd352b",
|
| 681 |
-
"metadata": {},
|
| 682 |
-
"outputs": [],
|
| 683 |
-
"source": [
|
| 684 |
-
"class ReviewSummarizer:\n",
|
| 685 |
-
" \"\"\"\n",
|
| 686 |
-
" A class to handle the summarization of product reviews using a pre-trained T5 model.\n",
|
| 687 |
-
" \"\"\"\n",
|
| 688 |
-
" def __init__(self, model_name='t5-small'):\n",
|
| 689 |
-
" \"\"\"\n",
|
| 690 |
-
" Initializes the summarizer with a pre-trained T5 model and tokenizer.\n",
|
| 691 |
-
"\n",
|
| 692 |
-
" Args:\n",
|
| 693 |
-
" model_name (str): The name of the pre-trained T5 model to use.\n",
|
| 694 |
-
" \"\"\"\n",
|
| 695 |
-
" print(f\"Loading summarization model: {model_name}...\")\n",
|
| 696 |
-
" self.model_name = model_name\n",
|
| 697 |
-
" self.device = 'cuda' if torch.cuda.is_available() else 'cpu'\n",
|
| 698 |
-
"\n",
|
| 699 |
-
" # Load the tokenizer and model from Hugging Face\n",
|
| 700 |
-
" self.tokenizer = T5Tokenizer.from_pretrained(self.model_name)\n",
|
| 701 |
-
" self.model = T5ForConditionalGeneration.from_pretrained(self.model_name).to(self.device)\n",
|
| 702 |
-
" print(\"Summarization model loaded successfully.\")\n",
|
| 703 |
-
"\n",
|
| 704 |
-
" def summarize(self, text: str, max_length: int = 50, min_length: int = 10) -> str:\n",
|
| 705 |
-
" \"\"\"\n",
|
| 706 |
-
" Generates a summary for a given text.\n",
|
| 707 |
-
"\n",
|
| 708 |
-
" Args:\n",
|
| 709 |
-
" text (str): The review text to summarize.\n",
|
| 710 |
-
" max_length (int): The maximum length of the generated summary.\n",
|
| 711 |
-
" min_length (int): The minimum length of the generated summary.\n",
|
| 712 |
-
"\n",
|
| 713 |
-
" Returns:\n",
|
| 714 |
-
" str: The generated summary.\n",
|
| 715 |
-
" \"\"\"\n",
|
| 716 |
-
" if not text or not isinstance(text, str):\n",
|
| 717 |
-
" return \"\"\n",
|
| 718 |
-
"\n",
|
| 719 |
-
" # T5 models require a prefix for the task. For summarization, it's \"summarize: \"\n",
|
| 720 |
-
" preprocess_text = f\"summarize: {text.strip()}\"\n",
|
| 721 |
-
"\n",
|
| 722 |
-
" # Tokenize the input text\n",
|
| 723 |
-
" tokenized_text = self.tokenizer.encode(preprocess_text, return_tensors=\"pt\").to(self.device)\n",
|
| 724 |
-
"\n",
|
| 725 |
-
" # Generate the summary\n",
|
| 726 |
-
" summary_ids = self.model.generate(\n",
|
| 727 |
-
" tokenized_text,\n",
|
| 728 |
-
" max_length=max_length,\n",
|
| 729 |
-
" min_length=min_length,\n",
|
| 730 |
-
" length_penalty=2.0,\n",
|
| 731 |
-
" num_beams=4,\n",
|
| 732 |
-
" early_stopping=True\n",
|
| 733 |
-
" )\n",
|
| 734 |
-
"\n",
|
| 735 |
-
" # Decode the summary and return it\n",
|
| 736 |
-
" summary = self.tokenizer.decode(summary_ids[0], skip_special_tokens=True)\n",
|
| 737 |
-
" return summary\n",
|
| 738 |
-
"\n",
|
| 739 |
-
"class AspectAnalyzer:\n",
|
| 740 |
-
" \"\"\"\n",
|
| 741 |
-
" A class to handle Aspect-Based Sentiment Analysis (ABSA) using a pre-trained model.\n",
|
| 742 |
-
" \"\"\"\n",
|
| 743 |
-
" # Changed to a different, currently valid lightweight model for ABSA.\n",
|
| 744 |
-
" def __init__(self, model_name='yangheng/deberta-v3-base-absa-v1.1', force_cpu=False):\n",
|
| 745 |
-
" \"\"\"\n",
|
| 746 |
-
" Initializes the ABSA pipeline with a pre-trained model.\n",
|
| 747 |
-
"\n",
|
| 748 |
-
" Args:\n",
|
| 749 |
-
" model_name (str): The name of the pre-trained ABSA model.\n",
|
| 750 |
-
" force_cpu (bool): If True, forces the model to run on the CPU.\n",
|
| 751 |
-
" \"\"\"\n",
|
| 752 |
-
" print(f\"Loading Aspect-Based Sentiment Analysis model: {model_name}...\")\n",
|
| 753 |
-
" self.model_name = model_name\n",
|
| 754 |
-
"\n",
|
| 755 |
-
" if force_cpu:\n",
|
| 756 |
-
" self.device = -1 # Use -1 for CPU in pipeline\n",
|
| 757 |
-
" print(\"Forcing ABSA model to run on CPU.\")\n",
|
| 758 |
-
" else:\n",
|
| 759 |
-
" self.device = 0 if torch.cuda.is_available() else -1\n",
|
| 760 |
-
"\n",
|
| 761 |
-
" print(f\"Using device: {self.device} (0 for GPU, -1 for CPU)\")\n",
|
| 762 |
-
"\n",
|
| 763 |
-
" self.absa_pipeline = pipeline(\n",
|
| 764 |
-
" \"text-classification\",\n",
|
| 765 |
-
" model=self.model_name,\n",
|
| 766 |
-
" tokenizer=self.model_name,\n",
|
| 767 |
-
" device=self.device\n",
|
| 768 |
-
" )\n",
|
| 769 |
-
" print(\"ABSA model loaded successfully.\")\n",
|
| 770 |
-
"\n",
|
| 771 |
-
" def analyze(self, text: str, aspects: list) -> dict:\n",
|
| 772 |
-
" \"\"\"\n",
|
| 773 |
-
" Analyzes the sentiment towards a list of aspects within a given text.\n",
|
| 774 |
-
" \"\"\"\n",
|
| 775 |
-
" if not text or not isinstance(text, str) or not aspects:\n",
|
| 776 |
-
" return {}\n",
|
| 777 |
-
"\n",
|
| 778 |
-
" # The model expects the review and aspect separated by a special token.\n",
|
| 779 |
-
" # Note: Different ABSA models might expect different input formats.\n",
|
| 780 |
-
" # This format is common but may need adjustment for other models.\n",
|
| 781 |
-
" inputs = [f\"{text} [SEP] {aspect}\" for aspect in aspects]\n",
|
| 782 |
-
" results = self.absa_pipeline(inputs)\n",
|
| 783 |
-
"\n",
|
| 784 |
-
" # Process results into a user-friendly dictionary\n",
|
| 785 |
-
" aspect_sentiments = {}\n",
|
| 786 |
-
" for aspect, result in zip(aspects, results):\n",
|
| 787 |
-
" aspect_sentiments[aspect] = {'sentiment': result['label'], 'score': result['score']}\n",
|
| 788 |
-
"\n",
|
| 789 |
-
" return aspect_sentiments\n",
|
| 790 |
-
"\n",
|
| 791 |
-
"class FineTunedSentimentClassifier:\n",
|
| 792 |
-
" \"\"\"\n",
|
| 793 |
-
" This class handles loading the fine-tuned checkpoint and making predictions.\n",
|
| 794 |
-
" \"\"\"\n",
|
| 795 |
-
" def __init__(self, checkpoint_path, model_name='distilbert-base-uncased', force_cpu=False):\n",
|
| 796 |
-
" self.device = 'cpu' if force_cpu else ('cuda' if torch.cuda.is_available() else 'cpu')\n",
|
| 797 |
-
" print(f\"Loading fine-tuned sentiment model from checkpoint: {checkpoint_path}...\")\n",
|
| 798 |
-
" print(f\"Using device: {self.device}\")\n",
|
| 799 |
-
"\n",
|
| 800 |
-
" self.model = SentimentClassifier.load_from_checkpoint(checkpoint_path, map_location=self.device)\n",
|
| 801 |
-
" self.model.to(self.device)\n",
|
| 802 |
-
" self.model.eval() # Set model to evaluation mode\n",
|
| 803 |
-
"\n",
|
| 804 |
-
" self.tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
|
| 805 |
-
" self.labels = ['NEGATIVE', 'POSITIVE']\n",
|
| 806 |
-
" print(\"Fine-tuned sentiment model loaded successfully.\")\n",
|
| 807 |
-
"\n",
|
| 808 |
-
" def classify(self, text: str) -> dict:\n",
|
| 809 |
-
" encoding = self.tokenizer.encode_plus(\n",
|
| 810 |
-
" text, add_special_tokens=True, max_length=128,\n",
|
| 811 |
-
" return_token_type_ids=False, padding=\"max_length\",\n",
|
| 812 |
-
" truncation=True, return_attention_mask=True, return_tensors='pt',\n",
|
| 813 |
-
" )\n",
|
| 814 |
-
" input_ids = encoding[\"input_ids\"].to(self.device)\n",
|
| 815 |
-
" attention_mask = encoding[\"attention_mask\"].to(self.device)\n",
|
| 816 |
-
" with torch.no_grad():\n",
|
| 817 |
-
" outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)\n",
|
| 818 |
-
" logits = outputs.logits\n",
|
| 819 |
-
" probabilities = torch.softmax(logits, dim=1)\n",
|
| 820 |
-
" prediction_idx = torch.argmax(probabilities, dim=1).item()\n",
|
| 821 |
-
" return {'label': self.labels[prediction_idx], 'score': probabilities[0][prediction_idx].item()}\n",
|
| 822 |
-
"\n",
|
| 823 |
-
"class AspectExtractor:\n",
|
| 824 |
-
" \"\"\"\n",
|
| 825 |
-
" This class uses a Part-of-Speech (POS) tagging model to first extract all\n",
|
| 826 |
-
" potential aspect terms (nouns) from a review text. It then filters these\n",
|
| 827 |
-
" nouns against a pre-defined dictionary of valid aspects for a given\n",
|
| 828 |
-
" product category to return only the relevant features.\n",
|
| 829 |
-
" \"\"\"\n",
|
| 830 |
-
" def __init__(self, model_name=\"vblagoje/bert-english-uncased-finetuned-pos\", force_cpu=False):\n",
|
| 831 |
-
" self.model_name = model_name\n",
|
| 832 |
-
" self.device = 'cpu' if force_cpu else ('cuda' if torch.cuda.is_available() else 'cpu')\n",
|
| 833 |
-
" print(f\"Loading Part-of-Speech (POS) tagging model: {self.model_name}...\")\n",
|
| 834 |
-
" print(f\"Using device: {self.device}\")\n",
|
| 835 |
-
"\n",
|
| 836 |
-
" self.pipeline = pipeline(\n",
|
| 837 |
-
" \"token-classification\",\n",
|
| 838 |
-
" model=self.model_name,\n",
|
| 839 |
-
" device=-1 if self.device == 'cpu' else 0,\n",
|
| 840 |
-
" aggregation_strategy=\"simple\"\n",
|
| 841 |
-
" )\n",
|
| 842 |
-
" print(\"POS tagging model loaded successfully.\")\n",
|
| 843 |
-
"\n",
|
| 844 |
-
" def extract(self, text: str, aspect_dictionary: list) -> list:\n",
|
| 845 |
-
" \"\"\"\n",
|
| 846 |
-
" Extracts aspects from the given text that are present in the provided\n",
|
| 847 |
-
" aspect dictionary.\n",
|
| 848 |
-
"\n",
|
| 849 |
-
" Args:\n",
|
| 850 |
-
" text (str): The review text to analyze.\n",
|
| 851 |
-
" aspect_dictionary (list): A list of valid, known aspects for the\n",
|
| 852 |
-
" product category.\n",
|
| 853 |
-
"\n",
|
| 854 |
-
" Returns:\n",
|
| 855 |
-
" list: A list of aspects that were both found in the text and are\n",
|
| 856 |
-
" present in the aspect dictionary.\n",
|
| 857 |
-
" \"\"\"\n",
|
| 858 |
-
" if not text or not aspect_dictionary:\n",
|
| 859 |
-
" return []\n",
|
| 860 |
-
"\n",
|
| 861 |
-
" # 1. Extract all nouns from the text using the POS model\n",
|
| 862 |
-
" model_outputs = self.pipeline(text)\n",
|
| 863 |
-
" noun_tags = {'NOUN', 'PROPN'}\n",
|
| 864 |
-
" extracted_nouns = {\n",
|
| 865 |
-
" output['word'].lower() for output in model_outputs\n",
|
| 866 |
-
" if output['entity_group'] in noun_tags\n",
|
| 867 |
-
" }\n",
|
| 868 |
-
"\n",
|
| 869 |
-
" # 2. Filter the extracted nouns against the provided dictionary\n",
|
| 870 |
-
" # We find the intersection between the two sets.\n",
|
| 871 |
-
" valid_aspects = {aspect.lower() for aspect in aspect_dictionary}\n",
|
| 872 |
-
"\n",
|
| 873 |
-
" final_aspects = list(extracted_nouns.intersection(valid_aspects))\n",
|
| 874 |
-
"\n",
|
| 875 |
-
" return final_aspects\n",
|
| 876 |
-
" "
|
| 877 |
-
]
|
| 878 |
-
},
|
| 879 |
-
{
|
| 880 |
-
"cell_type": "code",
|
| 881 |
-
"execution_count": null,
|
| 882 |
-
"id": "6fc21c8b",
|
| 883 |
-
"metadata": {},
|
| 884 |
-
"outputs": [],
|
| 885 |
-
"source": [
|
| 886 |
-
"# --- Configuration ---\n",
|
| 887 |
-
"# --- IMPORTANT: UPDATE THIS PATH ---\n",
|
| 888 |
-
"# You need to provide the path to the best checkpoint file that was saved\n",
|
| 889 |
-
"# during the training of your sentiment model.\n",
|
| 890 |
-
"SENTIMENT_CHECKPOINT_PATH = \"checkpoints/sentiment-binary-best-checkpoint.ckpt\"\n",
|
| 891 |
-
"\n",
|
| 892 |
-
"# --- Pre-defined Aspect Dictionaries for Different Product Categories ---\n",
|
| 893 |
-
"ASPECT_DICTIONARIES = {\n",
|
| 894 |
-
" \"Phone\": ['camera', 'battery', 'battery life', 'screen', 'performance', 'price', 'design'],\n",
|
| 895 |
-
" \"Coffee Maker\": ['ease of use', 'design', 'noise level', 'coffee quality', 'brew time', 'cleaning'],\n",
|
| 896 |
-
" \"Book\": ['plot', 'characters', 'writing style', 'pacing', 'ending'],\n",
|
| 897 |
-
" \"Default\": ['quality', 'price', 'service', 'design', 'features'] # A fallback list\n",
|
| 898 |
-
"}\n",
|
| 899 |
-
"\n",
|
| 900 |
-
"def main():\n",
|
| 901 |
-
" \"\"\"\n",
|
| 902 |
-
" Main function to run the command-line review analysis tool.\n",
|
| 903 |
-
" \"\"\"\n",
|
| 904 |
-
" # --- 1. Load All Models ---\n",
|
| 905 |
-
" print(\"--- Initializing all models ---\")\n",
|
| 906 |
-
" sentiment_classifier, summarizer, aspect_analyzer, aspect_extractor = None, None, None, None\n",
|
| 907 |
-
" try:\n",
|
| 908 |
-
" summarizer = ReviewSummarizer(force_cpu=True)\n",
|
| 909 |
-
" aspect_analyzer = AspectAnalyzer(force_cpu=True)\n",
|
| 910 |
-
" aspect_extractor = AspectExtractor(force_cpu=True)\n",
|
| 911 |
-
"\n",
|
| 912 |
-
" if not os.path.exists(SENTIMENT_CHECKPOINT_PATH):\n",
|
| 913 |
-
" print(\"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\")\n",
|
| 914 |
-
" print(\"!!! WARNING: Sentiment checkpoint path not found or not set. !!!\")\n",
|
| 915 |
-
" print(f\"!!! Please update the 'SENTIMENT_CHECKPOINT_PATH' variable in main.py\")\n",
|
| 916 |
-
" print(\"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\")\n",
|
| 917 |
-
" else:\n",
|
| 918 |
-
" sentiment_classifier = FineTunedSentimentClassifier(\n",
|
| 919 |
-
" checkpoint_path=SENTIMENT_CHECKPOINT_PATH, force_cpu=True\n",
|
| 920 |
-
" )\n",
|
| 921 |
-
" print(\"\\n--- All models loaded successfully ---\\n\")\n",
|
| 922 |
-
" except Exception as e:\n",
|
| 923 |
-
" print(f\"An error occurred during model initialization: {e}\")\n",
|
| 924 |
-
" return\n",
|
| 925 |
-
"\n",
|
| 926 |
-
" # --- 2. Interactive Loop ---\n",
|
| 927 |
-
" while True:\n",
|
| 928 |
-
" print(\"\\n==================================================\")\n",
|
| 929 |
-
" print(\" Product Review Analysis Tool \")\n",
|
| 930 |
-
" print(\"==================================================\")\n",
|
| 931 |
-
"\n",
|
| 932 |
-
" # Get user input\n",
|
| 933 |
-
" review_text = input(\"Enter the product review text (or type 'quit' to exit):\\n> \")\n",
|
| 934 |
-
" if review_text.lower() == 'quit':\n",
|
| 935 |
-
" break\n",
|
| 936 |
-
"\n",
|
| 937 |
-
" print(\"\\nAvailable Product Categories:\")\n",
|
| 938 |
-
" for i, category in enumerate(ASPECT_DICTIONARIES.keys(), 1):\n",
|
| 939 |
-
" print(f\"{i}. {category}\")\n",
|
| 940 |
-
"\n",
|
| 941 |
-
" category_choice = input(f\"Select a product category (1-{len(ASPECT_DICTIONARIES)}):\\n> \")\n",
|
| 942 |
-
" try:\n",
|
| 943 |
-
" category_idx = int(category_choice) - 1\n",
|
| 944 |
-
" product_category = list(ASPECT_DICTIONARIES.keys())[category_idx]\n",
|
| 945 |
-
" except (ValueError, IndexError):\n",
|
| 946 |
-
" print(\"Invalid choice. Using 'Default' category.\")\n",
|
| 947 |
-
" product_category = \"Default\"\n",
|
| 948 |
-
"\n",
|
| 949 |
-
" # --- 3. Run Analysis ---\n",
|
| 950 |
-
" print(\"\\n--- Analyzing Review... ---\")\n",
|
| 951 |
-
"\n",
|
| 952 |
-
" # a. Overall Sentiment\n",
|
| 953 |
-
" sentiment_result = sentiment_classifier.classify(review_text)\n",
|
| 954 |
-
"\n",
|
| 955 |
-
" # b. Summary\n",
|
| 956 |
-
" summary_result = summarizer.summarize(review_text)\n",
|
| 957 |
-
"\n",
|
| 958 |
-
" # c. Aspect Extraction and Analysis\n",
|
| 959 |
-
" aspect_dictionary = ASPECT_DICTIONARIES.get(product_category)\n",
|
| 960 |
-
" extracted_aspects = aspect_extractor.extract(review_text, aspect_dictionary)\n",
|
| 961 |
-
" aspect_results = None\n",
|
| 962 |
-
" if extracted_aspects:\n",
|
| 963 |
-
" aspect_results = aspect_analyzer.analyze(review_text, extracted_aspects)\n",
|
| 964 |
-
"\n",
|
| 965 |
-
" # --- 4. Display Results ---\n",
|
| 966 |
-
" print(\"\\n-------------------- ANALYSIS RESULTS --------------------\")\n",
|
| 967 |
-
" print(f\"\\n[ Overall Sentiment ]\")\n",
|
| 968 |
-
" print(f\" - Sentiment: {sentiment_result['label']} (Score: {sentiment_result['score']:.2f})\")\n",
|
| 969 |
-
"\n",
|
| 970 |
-
" print(f\"\\n[ Generated Summary ]\")\n",
|
| 971 |
-
" print(f\" - {summary_result}\")\n",
|
| 972 |
-
"\n",
|
| 973 |
-
" print(f\"\\n[ Detected Aspect Sentiments ]\")\n",
|
| 974 |
-
" if aspect_results:\n",
|
| 975 |
-
" for aspect, result in aspect_results.items():\n",
|
| 976 |
-
" print(f\" - {aspect.title()}: {result['sentiment']} (Score: {result['score']:.2f})\")\n",
|
| 977 |
-
" else:\n",
|
| 978 |
-
" print(\" - No relevant aspects from the dictionary were detected in the review.\")\n",
|
| 979 |
-
" print(\"----------------------------------------------------------\")\n"
|
| 980 |
-
]
|
| 981 |
-
},
|
| 982 |
-
{
|
| 983 |
-
"cell_type": "code",
|
| 984 |
-
"execution_count": null,
|
| 985 |
-
"id": "71257428",
|
| 986 |
-
"metadata": {},
|
| 987 |
-
"outputs": [],
|
| 988 |
-
"source": [
|
| 989 |
-
"# --- Run the workflow ---\n",
|
| 990 |
-
"main()"
|
| 991 |
-
]
|
| 992 |
-
}
|
| 993 |
-
],
|
| 994 |
-
"metadata": {
|
| 995 |
-
"language_info": {
|
| 996 |
-
"name": "python"
|
| 997 |
-
}
|
| 998 |
-
},
|
| 999 |
-
"nbformat": 4,
|
| 1000 |
-
"nbformat_minor": 5
|
| 1001 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
requirements.txt
CHANGED
|
@@ -1,11 +1,16 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
pandas==2.2.2
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
|
|
|
| 1 |
+
langchain==0.3.27
|
| 2 |
+
langchain-community==0.3.31
|
| 3 |
+
gradio==5.49.1
|
| 4 |
+
llama_cpp_python==0.3.16
|
| 5 |
+
sentence-transformers==5.1.1
|
| 6 |
+
torch==2.8.0
|
| 7 |
+
transformers==4.57.1
|
| 8 |
+
faiss-cpu==1.12.0
|
| 9 |
+
ragas==0.3.7
|
| 10 |
+
openai==1.109.1
|
| 11 |
pandas==2.2.2
|
| 12 |
+
datasets==4.0.0
|
| 13 |
+
numpy==2.0.2
|
| 14 |
+
accelerate==1.11.0
|
| 15 |
+
aiohttp==3.13.1
|
| 16 |
+
huggingface-hub==0.35.3
|
scripts/app.py
CHANGED
|
@@ -1,163 +1,280 @@
|
|
|
|
|
|
|
|
| 1 |
import gradio as gr
|
| 2 |
-
import os
|
| 3 |
import torch
|
| 4 |
-
from
|
| 5 |
-
|
| 6 |
-
import
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
try:
|
| 12 |
-
from models import (
|
| 13 |
-
ReviewSummarizer,
|
| 14 |
-
AspectAnalyzer,
|
| 15 |
-
AspectExtractor,
|
| 16 |
-
FineTunedSentimentClassifier
|
| 17 |
-
)
|
| 18 |
-
except ImportError:
|
| 19 |
-
print("CRITICAL ERROR: Make sure 'models.py' exists and contains the required classes.")
|
| 20 |
-
# Define dummy classes if imports fail, so Gradio can at least launch with an error message.
|
| 21 |
-
class ReviewSummarizer: pass
|
| 22 |
-
class AspectAnalyzer: pass
|
| 23 |
-
class AspectExtractor: pass
|
| 24 |
-
class FineTunedSentimentClassifier: pass
|
| 25 |
-
|
| 26 |
-
# --- Configuration ---
|
| 27 |
-
# This should be the relative path to your checkpoint file within the repository.
|
| 28 |
-
SENTIMENT_CHECKPOINT_PATH = "checkpoints/sentiment-binary-best-checkpoint.ckpt"
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
# --- Pre-defined Aspect Dictionaries for Different Product Categories ---
|
| 32 |
-
ASPECT_DICTIONARIES = {
|
| 33 |
-
"Phone": ['camera', 'battery', 'battery life', 'screen', 'performance', 'price', 'design'],
|
| 34 |
-
"Coffee Maker": ['ease of use', 'design', 'noise level', 'coffee quality', 'brew time', 'cleaning'],
|
| 35 |
-
"Book": ['plot', 'characters', 'writing style', 'pacing', 'ending'],
|
| 36 |
-
"Default": ['quality', 'price', 'service', 'design', 'features'] # A fallback list
|
| 37 |
-
}
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
# --- Load All Models (Global Objects) ---
|
| 41 |
-
print("--- Initializing all models for the Gradio App ---")
|
| 42 |
-
sentiment_classifier, summarizer, aspect_analyzer, aspect_extractor = None, None, None, None
|
| 43 |
-
try:
|
| 44 |
-
summarizer = ReviewSummarizer(force_cpu=True)
|
| 45 |
-
aspect_analyzer = AspectAnalyzer(force_cpu=True)
|
| 46 |
-
aspect_extractor = AspectExtractor(force_cpu=True)
|
| 47 |
-
|
| 48 |
-
if os.path.exists(SENTIMENT_CHECKPOINT_PATH):
|
| 49 |
-
sentiment_classifier = FineTunedSentimentClassifier(
|
| 50 |
-
checkpoint_path=SENTIMENT_CHECKPOINT_PATH, force_cpu=True
|
| 51 |
-
)
|
| 52 |
-
else:
|
| 53 |
-
print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
|
| 54 |
-
print("!!! WARNING: Sentiment checkpoint path not found. !!!")
|
| 55 |
-
print(f"!!! Path checked: '{SENTIMENT_CHECKPOINT_PATH}'")
|
| 56 |
-
print("!!! The fine-tuned sentiment model will NOT be loaded. !!!")
|
| 57 |
-
print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
|
| 58 |
-
print("\n--- All models loaded successfully ---\n")
|
| 59 |
-
except Exception as e:
|
| 60 |
-
print(f"An error occurred during model initialization: {e}")
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
# --- Define the Core Analysis Function ---
|
| 64 |
-
def analyze_review(review_text, product_category):
|
| 65 |
-
if not review_text:
|
| 66 |
-
return {"ERROR": "Please enter a review."}, "", None
|
| 67 |
-
|
| 68 |
-
# --- a. Overall Sentiment Analysis ---
|
| 69 |
-
if sentiment_classifier:
|
| 70 |
-
sentiment_result = sentiment_classifier.classify(review_text)
|
| 71 |
-
sentiment_output = {
|
| 72 |
-
sentiment_result['label']: f"{sentiment_result['score']:.2f}"
|
| 73 |
-
}
|
| 74 |
-
else:
|
| 75 |
-
# **ROBUST ERROR HANDLING:** This prevents the app from crashing.
|
| 76 |
-
# It returns a dictionary that the Gradio Label component can display.
|
| 77 |
-
sentiment_output = {"Error: Model Not Loaded": 1.0}
|
| 78 |
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
summary_output = summarizer.summarize(review_text)
|
| 82 |
-
else:
|
| 83 |
-
summary_output = "ERROR: Summarizer model not loaded."
|
| 84 |
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
if aspect_extractor and aspect_analyzer:
|
| 88 |
-
aspect_dictionary = ASPECT_DICTIONARIES.get(product_category, ASPECT_DICTIONARIES["Default"])
|
| 89 |
-
extracted_aspects = aspect_extractor.extract(review_text, aspect_dictionary=aspect_dictionary)
|
| 90 |
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
|
|
|
|
|
|
|
|
|
| 97 |
|
| 98 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
-
#
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
|
|
|
|
|
|
| 127 |
|
| 128 |
-
|
| 129 |
-
|
| 130 |
|
| 131 |
-
|
| 132 |
-
aspect_output = gr.DataFrame(headers=["Aspect", "Sentiment", "Score"], label="Aspects", interactive=False)
|
| 133 |
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
)
|
| 140 |
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 157 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 158 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 159 |
|
| 160 |
-
# --- Launch
|
| 161 |
if __name__ == "__main__":
|
| 162 |
-
|
| 163 |
-
demo.launch()
|
|
|
|
| 1 |
+
# app.py
|
| 2 |
+
|
| 3 |
import gradio as gr
|
|
|
|
| 4 |
import torch
|
| 5 |
+
from langchain_community.embeddings import HuggingFaceEmbeddings
|
| 6 |
+
from langchain_community.llms import LlamaCpp
|
| 7 |
+
from langchain.memory import ConversationBufferMemory
|
| 8 |
+
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
| 9 |
+
from langchain.prompts import PromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate
|
| 10 |
+
import os
|
| 11 |
+
import io
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
+
# Import the logic functions from src
|
| 14 |
+
import src.pipeline as pipeline
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
+
# --- Global Objects & Setup ---
|
| 17 |
+
# (Most setup code remains here as it's needed globally for the app)
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
+
print("--- Starting App Setup ---")
|
| 20 |
+
# 1. Download Model File
|
| 21 |
+
model_name = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"
|
| 22 |
+
model_url = "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf"
|
| 23 |
+
if not os.path.exists(model_name):
|
| 24 |
+
print("Downloading model...")
|
| 25 |
+
os.system(f"wget {model_url}")
|
| 26 |
+
else:
|
| 27 |
+
print("Model already downloaded.")
|
| 28 |
|
| 29 |
+
# 2. Prepare Default Sample Data & Example Batch
|
| 30 |
+
print("Loading default reviews...")
|
| 31 |
+
default_reviews_text = """
|
| 32 |
+
This laptop is a beast! The M3 chip is incredibly fast, and the battery lasts a solid 10 hours of heavy use... (rest of laptop reviews) ...dongle life is real.
|
| 33 |
+
---
|
| 34 |
+
I'm a student, and the battery life is a lifesaver... Highly recommend for college.
|
| 35 |
+
---
|
| 36 |
+
The keyboard is a dream to type on... Bluetooth connection dropping...
|
| 37 |
+
---
|
| 38 |
+
Video editing on this machine is flawless... price is very expensive...
|
| 39 |
+
---
|
| 40 |
+
I bought this for travel... battery easily gets me through a 6-hour flight...
|
| 41 |
+
---
|
| 42 |
+
Don't buy this if you need a lot of ports... only two USB-C ports...
|
| 43 |
+
"""
|
| 44 |
+
default_reviews_list = [r.strip() for r in default_reviews_text.strip().split('---') if r.strip()]
|
| 45 |
|
| 46 |
+
example_batch = """
|
| 47 |
+
I'm absolutely blown away by the "NovaBlend Pro" blender!... (rest of blender example)... save your money.
|
| 48 |
+
"""
|
| 49 |
|
| 50 |
+
# 3. Load Embedding Model, Text Splitter
|
| 51 |
+
print("Loading embedding model and text splitter...")
|
| 52 |
+
model_kwargs = {'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
|
| 53 |
+
embeddings = HuggingFaceEmbeddings(
|
| 54 |
+
model_name="sentence-transformers/all-MiniLM-L6-v2",
|
| 55 |
+
model_kwargs=model_kwargs
|
| 56 |
+
)
|
| 57 |
+
text_splitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=40)
|
| 58 |
|
| 59 |
+
# 4. Create Default Vector Store
|
| 60 |
+
print("Creating default FAISS vector store...")
|
| 61 |
+
default_vector_store = pipeline.create_vector_store_from_content(
|
| 62 |
+
"\n---\n".join(default_reviews_list), text_splitter, embeddings
|
| 63 |
+
)
|
| 64 |
+
if default_vector_store is None:
|
| 65 |
+
raise ValueError("Failed to create default vector store!")
|
| 66 |
+
print("Default vector store created successfully.")
|
| 67 |
+
|
| 68 |
+
# Global variable to hold the CURRENT vector store for the chatbot
|
| 69 |
+
# NOTE: Using a global like this works for simple Gradio apps but isn't
|
| 70 |
+
# robust for multiple users. Gradio state or session management is better
|
| 71 |
+
# for multi-user scenarios, but this keeps it simpler for now.
|
| 72 |
+
current_chatbot_vector_store = default_vector_store
|
| 73 |
+
current_context_source = "Default Laptop Reviews"
|
| 74 |
+
|
| 75 |
+
# 5. Load the LLM
|
| 76 |
+
print("Loading LLM (Mistral-7B GGUF)...")
|
| 77 |
+
llm = LlamaCpp(
|
| 78 |
+
model_path=model_name, n_gpu_layers=0, n_batch=512, n_ctx=4096,
|
| 79 |
+
f16_kv=True, temperature=0.0, max_tokens=512, verbose=False,
|
| 80 |
+
stop=["[/INST]", "User:", "Assistant:"]
|
| 81 |
+
)
|
| 82 |
+
|
| 83 |
+
# 6. Define All Prompts
|
| 84 |
+
print("Defining all prompts...")
|
| 85 |
+
# -- Phase 1 --
|
| 86 |
+
summary_template = """[INST] You are a helpful assistant... Reviews:\n{reviews} [/INST]\nConcise Summary:"""
|
| 87 |
+
summary_prompt = PromptTemplate(template=summary_template, input_variables=["reviews"])
|
| 88 |
+
aspect_template = """[INST] You are a helpful product analyst... Reviews:\n{reviews} [/INST]\nKey Pros and Cons:"""
|
| 89 |
+
aspect_prompt = PromptTemplate(template=aspect_template, input_variables=["reviews"])
|
| 90 |
+
sentiment_template = """[INST] You are a helpful sentiment analyst... Reviews:\n{reviews} [/INST]\nOverall Sentiment (Score 1-10):"""
|
| 91 |
+
sentiment_prompt = PromptTemplate(template=sentiment_template, input_variables=["reviews"])
|
| 92 |
+
# -- Phase 2 --
|
| 93 |
+
condense_question_template = """[INST] Given the following conversation... Follow Up Input: {question} [/INST]\nStandalone question:"""
|
| 94 |
+
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(condense_question_template)
|
| 95 |
+
qa_system_prompt = """[INST]
|
| 96 |
+
You are a factual assistant that answers only using the provided product reviews.
|
| 97 |
+
If the reviews include partial or uncertain information, summarize what they say.
|
| 98 |
+
If there is no information at all about the user’s question, respond with:
|
| 99 |
+
"I'm sorry, there isn't enough information in the reviews to answer that."
|
| 100 |
+
|
| 101 |
+
Do not use or infer information about price, comparisons to other brands, or availability unless they are directly mentioned in the reviews.
|
| 102 |
+
Always include a short "Evidence:" sentence if you found relevant mentions.
|
| 103 |
+
|
| 104 |
+
Context:
|
| 105 |
+
{context}
|
| 106 |
+
|
| 107 |
+
User question:
|
| 108 |
+
{question}
|
| 109 |
+
[/INST]
|
| 110 |
+
"""
|
| 111 |
+
qa_prompt = ChatPromptTemplate.from_messages([SystemMessagePromptTemplate.from_template(qa_system_prompt), HumanMessagePromptTemplate.from_template("Context:\n{context}\n\nQuestion:\n{question}\n\nHelpful Answer:")])
|
| 112 |
+
intent_template = """
|
| 113 |
+
[INST]
|
| 114 |
+
**CRITICAL INSTRUCTION:** Classify the user's query into ONLY ONE of two categories: "Product" or "Off-Topic".
|
| 115 |
+
Your response MUST be EXACTLY "Product" or EXACTLY "Off-Topic".
|
| 116 |
+
|
| 117 |
+
**EXAMPLES:**
|
| 118 |
+
Query: How is the battery life?
|
| 119 |
+
Classification: Product
|
| 120 |
+
Query: What are the complaints about the screen?
|
| 121 |
+
Classification: Product
|
| 122 |
+
Query: Does it come in blue?
|
| 123 |
+
Classification: Product
|
| 124 |
+
Query: What is the capital of France?
|
| 125 |
+
Classification: Off-Topic
|
| 126 |
+
Query: Hello there
|
| 127 |
+
Classification: Off-Topic
|
| 128 |
+
Query: Who are you?
|
| 129 |
+
Classification: Off-Topic
|
| 130 |
|
| 131 |
+
**NOW CLASSIFY THIS QUERY:**
|
| 132 |
+
Query: {query}
|
| 133 |
+
[/INST]
|
| 134 |
+
Classification:"""
|
| 135 |
+
intent_prompt = PromptTemplate(template=intent_template, input_variables=["query"])
|
| 136 |
|
| 137 |
+
# 7. Global Memory Object
|
| 138 |
+
chat_memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True, output_key='answer')
|
| 139 |
|
| 140 |
+
print("--- App Setup Complete ---")
|
|
|
|
| 141 |
|
| 142 |
+
|
| 143 |
+
# --- Gradio Helper Functions (Wrappers around pipeline logic) ---
|
| 144 |
+
|
| 145 |
+
def analyze_reviews_gradio_wrapper(review_text, review_file):
|
| 146 |
+
"""Gradio wrapper for Phase 1 analysis."""
|
| 147 |
+
content = ""
|
| 148 |
+
if review_file is not None:
|
| 149 |
+
try:
|
| 150 |
+
if hasattr(review_file, 'name'): file_path = review_file.name; f=open(file_path, 'rb'); byte_content = f.read(); f.close()
|
| 151 |
+
else: byte_content = review_file
|
| 152 |
+
try: content = byte_content.decode('utf-8')
|
| 153 |
+
except UnicodeDecodeError: content = byte_content.decode('latin-1')
|
| 154 |
+
except Exception as e: return f"Error reading file: {e}", "", ""
|
| 155 |
+
if not content: return "Error: File empty", "", ""
|
| 156 |
+
elif review_text:
|
| 157 |
+
content = review_text
|
| 158 |
+
else:
|
| 159 |
+
return "Please paste reviews or upload a file.", "", ""
|
| 160 |
+
|
| 161 |
+
# Call the core logic function
|
| 162 |
+
return pipeline.analyze_reviews_logic(
|
| 163 |
+
content, llm, summary_prompt, aspect_prompt, sentiment_prompt
|
| 164 |
)
|
| 165 |
|
| 166 |
+
def update_chatbot_context_gradio_wrapper(chatbot_file_upload):
|
| 167 |
+
"""Gradio wrapper to update chatbot context."""
|
| 168 |
+
global current_chatbot_vector_store, current_context_source # Modify globals
|
| 169 |
+
|
| 170 |
+
if chatbot_file_upload is None:
|
| 171 |
+
return f"No file uploaded. Chatbot context remains: **{current_context_source}**."
|
| 172 |
+
|
| 173 |
+
print("Processing chatbot context file via Gradio...")
|
| 174 |
+
content = ""
|
| 175 |
+
file_name = "Uploaded File"
|
| 176 |
+
try:
|
| 177 |
+
if hasattr(chatbot_file_upload, 'name'):
|
| 178 |
+
file_path = chatbot_file_upload.name
|
| 179 |
+
file_name = os.path.basename(file_path)
|
| 180 |
+
with open(file_path, 'rb') as f: byte_content = f.read()
|
| 181 |
+
else: byte_content = chatbot_file_upload
|
| 182 |
+
try: content = byte_content.decode('utf-8')
|
| 183 |
+
except UnicodeDecodeError: content = byte_content.decode('latin-1')
|
| 184 |
+
except Exception as e: return f"Error reading file: {e}. Context not updated."
|
| 185 |
+
if not content: return "File empty. Context not updated."
|
| 186 |
+
|
| 187 |
+
# Call the core logic function to create the store
|
| 188 |
+
new_vector_store = pipeline.create_vector_store_from_content(content, text_splitter, embeddings)
|
| 189 |
+
|
| 190 |
+
if new_vector_store:
|
| 191 |
+
current_chatbot_vector_store = new_vector_store # Update global store
|
| 192 |
+
current_context_source = f"File: {file_name}"
|
| 193 |
+
status_message = f"Chatbot context updated using **{file_name}**."
|
| 194 |
+
print(status_message)
|
| 195 |
+
return status_message
|
| 196 |
+
else:
|
| 197 |
+
# If store creation failed, keep the old one
|
| 198 |
+
status_message = f"Error creating context from {file_name}. Chatbot context remains: **{current_context_source}**."
|
| 199 |
+
print(status_message)
|
| 200 |
+
return status_message
|
| 201 |
+
|
| 202 |
+
|
| 203 |
+
def chat_responder_gradio_wrapper(message, chat_history):
|
| 204 |
+
"""Gradio wrapper for the chatbot response logic."""
|
| 205 |
+
# Pass necessary global objects to the core logic function
|
| 206 |
+
response = pipeline.get_chatbot_response(
|
| 207 |
+
message=message,
|
| 208 |
+
chat_memory=chat_memory,
|
| 209 |
+
vector_store=current_chatbot_vector_store, # Use the current global store
|
| 210 |
+
llm=llm,
|
| 211 |
+
intent_prompt=intent_prompt,
|
| 212 |
+
condense_prompt=CONDENSE_QUESTION_PROMPT,
|
| 213 |
+
qa_prompt=qa_prompt
|
| 214 |
)
|
| 215 |
+
return response
|
| 216 |
+
|
| 217 |
+
def clear_chat_memory_gradio_wrapper():
|
| 218 |
+
"""Gradio wrapper to clear memory."""
|
| 219 |
+
print("Clearing chat memory via Gradio button...")
|
| 220 |
+
chat_memory.clear()
|
| 221 |
+
print("Chat memory cleared.")
|
| 222 |
+
return [] # Return empty list to clear ChatInterface display
|
| 223 |
+
|
| 224 |
+
def reset_context_to_default_gradio_wrapper():
|
| 225 |
+
"""Gradio wrapper to reset context to default."""
|
| 226 |
+
global current_chatbot_vector_store, current_context_source
|
| 227 |
+
print("Resetting context via Gradio button...")
|
| 228 |
+
current_chatbot_vector_store = default_vector_store
|
| 229 |
+
current_context_source = "Default Laptop Reviews"
|
| 230 |
+
status_msg = f"Chatbot context reset to **{current_context_source}**."
|
| 231 |
+
print(status_msg)
|
| 232 |
+
return status_msg
|
| 233 |
+
|
| 234 |
|
| 235 |
+
# --- Gradio UI Definition ---
|
| 236 |
+
with gr.Blocks(theme=gr.themes.Soft()) as demo:
|
| 237 |
+
gr.Markdown("# 🤖 Product Review Intelligence Center")
|
| 238 |
+
gr.Markdown("Analyze product reviews using Mistral-7B (Tab 1) or chat about reviews with customizable context (Tab 2).")
|
| 239 |
+
|
| 240 |
+
with gr.Tabs():
|
| 241 |
+
# --- TAB 1: BATCH ANALYZER ---
|
| 242 |
+
with gr.TabItem("Batch Analyzer"):
|
| 243 |
+
gr.Markdown("Paste reviews OR upload a file (.txt, .csv) to analyze them.")
|
| 244 |
+
gr.Markdown("**Note:** This analysis does *not* affect the chatbot's context in Tab 2.")
|
| 245 |
+
with gr.Row():
|
| 246 |
+
with gr.Column(scale=2):
|
| 247 |
+
review_input_text_tab1 = gr.Textbox(lines=15, placeholder="Paste reviews here...", label="Reviews Text Input")
|
| 248 |
+
review_input_file_tab1 = gr.File(label="Upload Reviews File (.txt, .csv)", file_types=[".txt", ".csv"])
|
| 249 |
+
with gr.Column(scale=1):
|
| 250 |
+
summary_output_tab1 = gr.Textbox(label="Overall Summary", lines=5, interactive=False)
|
| 251 |
+
aspect_output_tab1 = gr.Textbox(label="Key Aspects (Pros/Cons)", lines=5, interactive=False)
|
| 252 |
+
sentiment_output_tab1 = gr.Textbox(label="Sentiment Analysis", lines=5, interactive=False)
|
| 253 |
+
analyze_button_tab1 = gr.Button("Analyze Reviews")
|
| 254 |
+
gr.Examples(examples=[[example_batch, None]], inputs=[review_input_text_tab1, review_input_file_tab1], outputs=[summary_output_tab1, aspect_output_tab1, sentiment_output_tab1], fn=analyze_reviews_gradio_wrapper, cache_examples=False) # Use wrapper
|
| 255 |
+
analyze_button_tab1.click(fn=analyze_reviews_gradio_wrapper, inputs=[review_input_text_tab1, review_input_file_tab1], outputs=[summary_output_tab1, aspect_output_tab1, sentiment_output_tab1]) # Use wrapper
|
| 256 |
+
|
| 257 |
+
# --- TAB 2: CHAT ABOUT REVIEWS ---
|
| 258 |
+
with gr.TabItem("Ask a Question (Chatbot)"):
|
| 259 |
+
gr.Markdown("Ask specific questions about product reviews. Upload a file below to change the chatbot's knowledge base.")
|
| 260 |
+
chatbot_status_display = gr.Markdown(f"Chatbot is currently using: **{current_context_source}**")
|
| 261 |
+
with gr.Row():
|
| 262 |
+
chatbot_context_file = gr.File(label="Upload Chatbot Context File (.txt, .csv)", file_types=[".txt", ".csv"], scale=3)
|
| 263 |
+
update_context_button = gr.Button("Update Chatbot Context", scale=1)
|
| 264 |
+
chatbot_interface = gr.ChatInterface(
|
| 265 |
+
fn=chat_responder_gradio_wrapper, # Use wrapper
|
| 266 |
+
examples=["How is the battery life?", "What about the screen?", "What are the complaints about connectivity?", "What is the capital of France?"],
|
| 267 |
+
title="Review Chatbot"
|
| 268 |
+
)
|
| 269 |
+
with gr.Row():
|
| 270 |
+
reset_memory_button = gr.Button("🔄 Reset Chat Memory")
|
| 271 |
+
reset_context_button = gr.Button("🔄 Reset Chatbot Context to Default")
|
| 272 |
+
# Link actions to wrapper functions
|
| 273 |
+
update_context_button.click(fn=update_chatbot_context_gradio_wrapper, inputs=[chatbot_context_file], outputs=[chatbot_status_display])
|
| 274 |
+
reset_memory_button.click(fn=clear_chat_memory_gradio_wrapper, inputs=None, outputs=[chatbot_interface])
|
| 275 |
+
reset_context_button.click(fn=reset_context_to_default_gradio_wrapper, inputs=None, outputs=[chatbot_status_display])
|
| 276 |
|
| 277 |
+
# --- Launch Command ---
|
| 278 |
if __name__ == "__main__":
|
| 279 |
+
chat_memory.clear() # Clear memory each time app starts
|
| 280 |
+
demo.launch(debug=True)
|
scripts/data_prepare.py
DELETED
|
@@ -1,263 +0,0 @@
|
|
| 1 |
-
import pytorch_lightning as pl
|
| 2 |
-
from torch.utils.data import DataLoader, Dataset
|
| 3 |
-
from transformers import AutoTokenizer
|
| 4 |
-
import pandas as pd
|
| 5 |
-
from sklearn.model_selection import train_test_split
|
| 6 |
-
import torch
|
| 7 |
-
import os
|
| 8 |
-
|
| 9 |
-
def explore_and_preprocess_reviews(
|
| 10 |
-
train_path='data/train.csv',
|
| 11 |
-
test_path='data/test.csv',
|
| 12 |
-
output_dir='data'
|
| 13 |
-
):
|
| 14 |
-
"""
|
| 15 |
-
Loads the Amazon Sentiment Analysis dataset (https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews)
|
| 16 |
-
(you need to extract the train/test splits from the zip file in the data folder),
|
| 17 |
-
performs basic EDA, and preprocesses it for model training.
|
| 18 |
-
|
| 19 |
-
Args:
|
| 20 |
-
train_path (str): Path to the training CSV file.
|
| 21 |
-
test_path (str): Path to the testing CSV file.
|
| 22 |
-
output_dir (str): Directory to save the processed file.
|
| 23 |
-
"""
|
| 24 |
-
# --- 1. Load Data ---
|
| 25 |
-
# This dataset typically comes without headers. We'll assign them.
|
| 26 |
-
# Column 1: Sentiment (1 = Negative, 2 = Positive)
|
| 27 |
-
# Column 2: Title
|
| 28 |
-
# Column 3: Review Text
|
| 29 |
-
print(f"Loading data from '{train_path}' and '{test_path}'...")
|
| 30 |
-
try:
|
| 31 |
-
col_names = ['sentiment_orig', 'title', 'review']
|
| 32 |
-
train_df = pd.read_csv(train_path, header=None, names=col_names)
|
| 33 |
-
test_df = pd.read_csv(test_path, header=None, names=col_names)
|
| 34 |
-
|
| 35 |
-
# Combine for unified EDA and preprocessing
|
| 36 |
-
df = pd.concat([train_df, test_df], ignore_index=True)
|
| 37 |
-
|
| 38 |
-
except FileNotFoundError:
|
| 39 |
-
print(f"\nERROR: Make sure '{train_path}' and '{test_path}' are in the specified directory.")
|
| 40 |
-
print("This script is designed for the 'Amazon Reviews for Sentiment Analysis' dataset from Kaggle.")
|
| 41 |
-
return
|
| 42 |
-
|
| 43 |
-
df.dropna(inplace=True)
|
| 44 |
-
|
| 45 |
-
# --- 2. Preprocessing ---
|
| 46 |
-
print("\n--- Preprocessing Data for Sentiment Analysis ---")
|
| 47 |
-
|
| 48 |
-
# a) Create new sentiment labels (0 = Negative, 1 = Positive)
|
| 49 |
-
# This dataset is binary, not three-class like the previous one.
|
| 50 |
-
df['sentiment'] = df['sentiment_orig'].apply(lambda x: 0 if x == 1 else 1)
|
| 51 |
-
|
| 52 |
-
# b) Combine title and review body
|
| 53 |
-
df['full_text'] = df['title'].astype(str) + ". " + df['review'].astype(str)
|
| 54 |
-
|
| 55 |
-
# c) Select and rename columns
|
| 56 |
-
processed_df = df[['full_text', 'sentiment']].copy()
|
| 57 |
-
|
| 58 |
-
# --- 4. Save Processed Data ---
|
| 59 |
-
os.makedirs(output_dir, exist_ok=True)
|
| 60 |
-
output_path = os.path.join(output_dir, 'reviews_processed.csv')
|
| 61 |
-
processed_df.to_csv(output_path, index=False)
|
| 62 |
-
print(f"\nSaved {len(processed_df)} processed reviews to '{output_path}'")
|
| 63 |
-
|
| 64 |
-
class ReviewDataset(Dataset):
|
| 65 |
-
"""
|
| 66 |
-
Custom PyTorch Dataset for Amazon Reviews.
|
| 67 |
-
|
| 68 |
-
This class takes a pandas DataFrame of review data, a tokenizer, and a max
|
| 69 |
-
token length, and prepares it for use in a PyTorch model. It handles the
|
| 70 |
-
tokenization of the text and the formatting of the labels for each item.
|
| 71 |
-
|
| 72 |
-
Attributes:
|
| 73 |
-
tokenizer: The Hugging Face tokenizer to use for processing text.
|
| 74 |
-
data (pd.DataFrame): The DataFrame containing the review data.
|
| 75 |
-
max_token_len (int): The maximum sequence length for the tokenizer.
|
| 76 |
-
"""
|
| 77 |
-
def __init__(self, data: pd.DataFrame, tokenizer, max_token_len: int):
|
| 78 |
-
"""
|
| 79 |
-
Initializes the ReviewDataset.
|
| 80 |
-
|
| 81 |
-
Args:
|
| 82 |
-
data (pd.DataFrame): The input DataFrame containing 'full_text' and
|
| 83 |
-
'sentiment' columns.
|
| 84 |
-
tokenizer: The pre-trained tokenizer instance.
|
| 85 |
-
max_token_len (int): The maximum length for tokenized sequences.
|
| 86 |
-
"""
|
| 87 |
-
self.tokenizer = tokenizer
|
| 88 |
-
self.data = data
|
| 89 |
-
self.max_token_len = max_token_len
|
| 90 |
-
|
| 91 |
-
def __len__(self):
|
| 92 |
-
"""
|
| 93 |
-
Returns the total number of samples in the dataset.
|
| 94 |
-
"""
|
| 95 |
-
return len(self.data)
|
| 96 |
-
|
| 97 |
-
def __getitem__(self, index: int):
|
| 98 |
-
"""
|
| 99 |
-
Retrieves one sample from the dataset at the specified index.
|
| 100 |
-
|
| 101 |
-
This method handles the tokenization of a single review text, including
|
| 102 |
-
padding and truncation, and formats the output into a dictionary of
|
| 103 |
-
tensors ready for the model.
|
| 104 |
-
|
| 105 |
-
Args:
|
| 106 |
-
index (int): The index of the data sample to retrieve.
|
| 107 |
-
|
| 108 |
-
Returns:
|
| 109 |
-
dict: A dictionary containing the tokenized inputs and the label,
|
| 110 |
-
with the following keys:
|
| 111 |
-
- 'input_ids': The token IDs of the review text.
|
| 112 |
-
- 'attention_mask': The attention mask for the review text.
|
| 113 |
-
- 'labels': The sentiment label as a tensor.
|
| 114 |
-
"""
|
| 115 |
-
data_row = self.data.iloc[index]
|
| 116 |
-
text = str(data_row.full_text)
|
| 117 |
-
labels = data_row.sentiment
|
| 118 |
-
|
| 119 |
-
encoding = self.tokenizer.encode_plus(
|
| 120 |
-
text,
|
| 121 |
-
add_special_tokens=True,
|
| 122 |
-
max_length=self.max_token_len,
|
| 123 |
-
return_token_type_ids=False,
|
| 124 |
-
padding="max_length",
|
| 125 |
-
truncation=True,
|
| 126 |
-
return_attention_mask=True,
|
| 127 |
-
return_tensors='pt',
|
| 128 |
-
)
|
| 129 |
-
|
| 130 |
-
return dict(
|
| 131 |
-
input_ids=encoding["input_ids"].flatten(),
|
| 132 |
-
attention_mask=encoding["attention_mask"].flatten(),
|
| 133 |
-
labels=torch.tensor(labels, dtype=torch.long)
|
| 134 |
-
)
|
| 135 |
-
|
| 136 |
-
class ReviewDataModule(pl.LightningDataModule):
|
| 137 |
-
"""
|
| 138 |
-
PyTorch Lightning DataModule to handle the Amazon Reviews dataset.
|
| 139 |
-
|
| 140 |
-
This class encapsulates all the steps needed to process the data:
|
| 141 |
-
loading, splitting, and creating PyTorch DataLoaders for training,
|
| 142 |
-
validation, and testing. It allows for using a smaller random sample of the
|
| 143 |
-
full dataset for faster experimentation.
|
| 144 |
-
|
| 145 |
-
Attributes:
|
| 146 |
-
data_path (str): Path to the processed CSV file.
|
| 147 |
-
batch_size (int): The size of each data batch.
|
| 148 |
-
max_token_len (int): The maximum sequence length for the tokenizer.
|
| 149 |
-
tokenizer: The Hugging Face tokenizer instance.
|
| 150 |
-
num_workers (int): The number of CPU cores to use for data loading.
|
| 151 |
-
sample_size (int, optional): The number of samples to use. If None,
|
| 152 |
-
the full dataset is used.
|
| 153 |
-
"""
|
| 154 |
-
def __init__(self, data_path: str, batch_size: int = 16, max_token_len: int = 256, model_name='distilbert-base-uncased', num_workers: int = 0, sample_size: int = None):
|
| 155 |
-
"""
|
| 156 |
-
Initializes the ReviewDataModule.
|
| 157 |
-
|
| 158 |
-
Args:
|
| 159 |
-
data_path (str): The path to the processed CSV data file.
|
| 160 |
-
batch_size (int): The number of samples per batch.
|
| 161 |
-
max_token_len (int): Maximum length of tokenized sequences.
|
| 162 |
-
model_name (str): The name of the pre-trained model to use for the tokenizer.
|
| 163 |
-
num_workers (int): Number of subprocesses to use for data loading.
|
| 164 |
-
sample_size (int, optional): If specified, a random sample of this
|
| 165 |
-
size will be used from the dataset.
|
| 166 |
-
Defaults to None, which uses the full dataset.
|
| 167 |
-
"""
|
| 168 |
-
super().__init__()
|
| 169 |
-
self.data_path = data_path
|
| 170 |
-
self.batch_size = batch_size
|
| 171 |
-
self.max_token_len = max_token_len
|
| 172 |
-
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 173 |
-
self.num_workers = num_workers
|
| 174 |
-
self.sample_size = sample_size
|
| 175 |
-
self.train_df = None
|
| 176 |
-
self.val_df = None
|
| 177 |
-
self.test_df = None
|
| 178 |
-
|
| 179 |
-
def setup(self, stage=None):
|
| 180 |
-
"""
|
| 181 |
-
Loads and splits the data for training, validation, and testing.
|
| 182 |
-
|
| 183 |
-
This method is called by PyTorch Lightning. It reads the CSV, handles
|
| 184 |
-
missing values, optionally takes a random sample, and performs a
|
| 185 |
-
stratified train-validation-test split. The indices of the resulting
|
| 186 |
-
DataFrames are reset to prevent potential KeyErrors during data loading.
|
| 187 |
-
"""
|
| 188 |
-
df = pd.read_csv(self.data_path)
|
| 189 |
-
df.dropna(inplace=True)
|
| 190 |
-
|
| 191 |
-
# If a sample size is provided, sample the dataframe
|
| 192 |
-
if self.sample_size:
|
| 193 |
-
print(f"Using a sample of {self.sample_size} reviews.")
|
| 194 |
-
df = df.sample(n=self.sample_size, random_state=42)
|
| 195 |
-
|
| 196 |
-
# Stratified split to maintain label distribution
|
| 197 |
-
train_val_df, self.test_df = train_test_split(df, test_size=0.1, random_state=42, stratify=df.sentiment)
|
| 198 |
-
self.train_df, self.val_df = train_test_split(train_val_df, test_size=0.1, random_state=42, stratify=train_val_df.sentiment)
|
| 199 |
-
|
| 200 |
-
# Reset indices to prevent KeyErrors
|
| 201 |
-
self.train_df = self.train_df.reset_index(drop=True)
|
| 202 |
-
self.val_df = self.val_df.reset_index(drop=True)
|
| 203 |
-
self.test_df = self.test_df.reset_index(drop=True)
|
| 204 |
-
|
| 205 |
-
print(f"Size of training set: {len(self.train_df)}")
|
| 206 |
-
print(f"Size of validation set: {len(self.val_df)}")
|
| 207 |
-
print(f"Size of test set: {len(self.test_df)}")
|
| 208 |
-
|
| 209 |
-
def train_dataloader(self):
|
| 210 |
-
"""Returns the DataLoader for the training set."""
|
| 211 |
-
return DataLoader(
|
| 212 |
-
ReviewDataset(self.train_df, self.tokenizer, self.max_token_len),
|
| 213 |
-
batch_size=self.batch_size,
|
| 214 |
-
shuffle=True,
|
| 215 |
-
num_workers=self.num_workers
|
| 216 |
-
)
|
| 217 |
-
|
| 218 |
-
def val_dataloader(self):
|
| 219 |
-
"""Returns the DataLoader for the validation set."""
|
| 220 |
-
return DataLoader(
|
| 221 |
-
ReviewDataset(self.val_df, self.tokenizer, self.max_token__len),
|
| 222 |
-
batch_size=self.batch_size,
|
| 223 |
-
num_workers=self.num_workers
|
| 224 |
-
)
|
| 225 |
-
|
| 226 |
-
def test_dataloader(self):
|
| 227 |
-
"""Returns the DataLoader for the test set."""
|
| 228 |
-
return DataLoader(
|
| 229 |
-
ReviewDataset(self.test_df, self.tokenizer, self.max_token_len),
|
| 230 |
-
batch_size=self.batch_size,
|
| 231 |
-
num_workers=self.num_workers
|
| 232 |
-
)
|
| 233 |
-
|
| 234 |
-
if __name__ == "__main__":
|
| 235 |
-
|
| 236 |
-
#--- Step 1: Preprocess the Reviews Dataset ---
|
| 237 |
-
print("\n--- Preprocessing started ---")
|
| 238 |
-
explore_and_preprocess_reviews()
|
| 239 |
-
print("\n--- Preprocessing finished ---")
|
| 240 |
-
# --- Configuration ---
|
| 241 |
-
data_path = "data/reviews_processed.csv"
|
| 242 |
-
BATCH_SIZE = 64
|
| 243 |
-
MAX_TOKEN_LEN = 256
|
| 244 |
-
|
| 245 |
-
print("Initializing ReviewDataModule...")
|
| 246 |
-
review_datamodule = ReviewDataModule(
|
| 247 |
-
data_path=data_path,
|
| 248 |
-
batch_size=BATCH_SIZE,
|
| 249 |
-
max_token_len=MAX_TOKEN_LEN,
|
| 250 |
-
model_name='distilbert-base-uncased',
|
| 251 |
-
sample_size=100000 # Pass the sample size to the datamodule
|
| 252 |
-
)
|
| 253 |
-
review_datamodule.setup()
|
| 254 |
-
|
| 255 |
-
# Fetch one batch from the training dataloader to inspect its contents
|
| 256 |
-
print("\n--- Fetching one batch from the training dataloader ---")
|
| 257 |
-
train_batch = next(iter(review_datamodule.train_dataloader()))
|
| 258 |
-
|
| 259 |
-
print("\n--- Example Batch ---")
|
| 260 |
-
print(f"Input IDs shape: {train_batch['input_ids'].shape}")
|
| 261 |
-
print(f"Attention Mask shape: {train_batch['attention_mask'].shape}")
|
| 262 |
-
print(f"Labels: {train_batch['labels']}")
|
| 263 |
-
print(f"Labels shape: {train_batch['labels'].shape}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/main.py
CHANGED
|
@@ -1,109 +1,212 @@
|
|
| 1 |
-
|
| 2 |
-
import torch
|
| 3 |
-
import pandas as pd
|
| 4 |
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
exit()
|
| 11 |
|
| 12 |
-
#
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
print(f"\n[ Detected Aspect Sentiments ]")
|
| 100 |
-
if aspect_results:
|
| 101 |
-
for aspect, result in aspect_results.items():
|
| 102 |
-
print(f" - {aspect.title()}: {result['sentiment']} (Score: {result['score']:.2f})")
|
| 103 |
-
else:
|
| 104 |
-
print(" - No relevant aspects from the dictionary were detected in the review.")
|
| 105 |
-
print("----------------------------------------------------------")
|
| 106 |
-
|
| 107 |
-
|
| 108 |
if __name__ == "__main__":
|
| 109 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# main.py
|
|
|
|
|
|
|
| 2 |
|
| 3 |
+
import torch
|
| 4 |
+
from langchain_community.embeddings import HuggingFaceEmbeddings
|
| 5 |
+
from langchain_community.llms import LlamaCpp
|
| 6 |
+
from langchain.memory import ConversationBufferMemory
|
| 7 |
+
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
| 8 |
+
from langchain.prompts import PromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate
|
| 9 |
+
import os
|
| 10 |
+
import argparse # For command-line arguments
|
| 11 |
+
|
| 12 |
+
# Import the logic functions from src
|
| 13 |
+
import src.pipeline as pipeline
|
| 14 |
+
|
| 15 |
+
# --- Global Objects & Setup ---
|
| 16 |
+
# (Similar setup as app.py, load models, prompts etc.)
|
| 17 |
+
print("--- Starting Local Execution Setup ---")
|
| 18 |
+
# 1. Check/Define Model Path
|
| 19 |
+
model_name = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"
|
| 20 |
+
if not os.path.exists(model_name):
|
| 21 |
+
print(f"ERROR: Model file '{model_name}' not found. Please download it first.")
|
| 22 |
exit()
|
| 23 |
|
| 24 |
+
# 2. Prepare Default Sample Data (Optional, for context testing)
|
| 25 |
+
default_reviews_text = """...""" # Paste default laptop reviews
|
| 26 |
+
default_reviews_list = [r.strip() for r in default_reviews_text.strip().split('---') if r.strip()]
|
| 27 |
+
|
| 28 |
+
# 3. Load Embedding Model, Text Splitter
|
| 29 |
+
print("Loading embedding model and text splitter...")
|
| 30 |
+
model_kwargs = {'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
|
| 31 |
+
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs=model_kwargs)
|
| 32 |
+
text_splitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=40)
|
| 33 |
+
|
| 34 |
+
# 4. Create Default Vector Store
|
| 35 |
+
print("Creating default FAISS vector store...")
|
| 36 |
+
default_vector_store = pipeline.create_vector_store_from_content(
|
| 37 |
+
"\n---\n".join(default_reviews_list), text_splitter, embeddings
|
| 38 |
+
)
|
| 39 |
+
if default_vector_store is None: raise ValueError("Failed to create default vector store!")
|
| 40 |
+
|
| 41 |
+
# 5. Load the LLM
|
| 42 |
+
print("Loading LLM (Mistral-7B GGUF)...")
|
| 43 |
+
llm = LlamaCpp(
|
| 44 |
+
model_path=model_name, n_gpu_layers=0, n_batch=512, n_ctx=4096,
|
| 45 |
+
f16_kv=True, temperature=0.0, max_tokens=512, verbose=False,
|
| 46 |
+
stop=["[/INST]", "User:", "Assistant:"]
|
| 47 |
+
)
|
| 48 |
+
|
| 49 |
+
# 6. Define All Prompts
|
| 50 |
+
print("Defining all prompts...")
|
| 51 |
+
# -- Phase 1 --
|
| 52 |
+
summary_template = """[INST] ... Reviews:\n{reviews} [/INST]\nConcise Summary:"""
|
| 53 |
+
summary_prompt = PromptTemplate(template=summary_template, input_variables=["reviews"])
|
| 54 |
+
aspect_template = """[INST] ... Reviews:\n{reviews} [/INST]\nKey Pros and Cons:"""
|
| 55 |
+
aspect_prompt = PromptTemplate(template=aspect_template, input_variables=["reviews"])
|
| 56 |
+
sentiment_template = """[INST] ... Reviews:\n{reviews} [/INST]\nOverall Sentiment (Score 1-10):"""
|
| 57 |
+
sentiment_prompt = PromptTemplate(template=sentiment_template, input_variables=["reviews"])
|
| 58 |
+
# -- Phase 2 --
|
| 59 |
+
condense_question_template = """[INST] Given the following conversation... Follow Up Input: {question} [/INST]\nStandalone question:"""
|
| 60 |
+
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(condense_question_template)
|
| 61 |
+
qa_system_prompt = """[INST]
|
| 62 |
+
You are a factual assistant that answers only using the provided product reviews.
|
| 63 |
+
If the reviews include partial or uncertain information, summarize what they say.
|
| 64 |
+
If there is no information at all about the user’s question, respond with:
|
| 65 |
+
"I'm sorry, there isn't enough information in the reviews to answer that."
|
| 66 |
+
|
| 67 |
+
Do not use or infer information about price, comparisons to other brands, or availability unless they are directly mentioned in the reviews.
|
| 68 |
+
Always include a short "Evidence:" sentence if you found relevant mentions.
|
| 69 |
+
|
| 70 |
+
Context:
|
| 71 |
+
{context}
|
| 72 |
+
|
| 73 |
+
User question:
|
| 74 |
+
{question}
|
| 75 |
+
[/INST]
|
| 76 |
+
"""
|
| 77 |
+
qa_prompt = ChatPromptTemplate.from_messages([SystemMessagePromptTemplate.from_template(qa_system_prompt), HumanMessagePromptTemplate.from_template("Context:\n{context}\n\nQuestion:\n{question}\n\nHelpful Answer:")])
|
| 78 |
+
intent_template = """
|
| 79 |
+
[INST]
|
| 80 |
+
**CRITICAL INSTRUCTION:** Classify the user's query into ONLY ONE of two categories: "Product" or "Off-Topic".
|
| 81 |
+
Your response MUST be EXACTLY "Product" or EXACTLY "Off-Topic".
|
| 82 |
+
|
| 83 |
+
**EXAMPLES:**
|
| 84 |
+
Query: How is the battery life?
|
| 85 |
+
Classification: Product
|
| 86 |
+
Query: What are the complaints about the screen?
|
| 87 |
+
Classification: Product
|
| 88 |
+
Query: Does it come in blue?
|
| 89 |
+
Classification: Product
|
| 90 |
+
Query: What is the capital of France?
|
| 91 |
+
Classification: Off-Topic
|
| 92 |
+
Query: Hello there
|
| 93 |
+
Classification: Off-Topic
|
| 94 |
+
Query: Who are you?
|
| 95 |
+
Classification: Off-Topic
|
| 96 |
+
|
| 97 |
+
**NOW CLASSIFY THIS QUERY:**
|
| 98 |
+
Query: {query}
|
| 99 |
+
[/INST]
|
| 100 |
+
Classification:"""
|
| 101 |
+
intent_prompt = PromptTemplate(template=intent_template, input_variables=["query"])
|
| 102 |
+
|
| 103 |
+
# 7. Memory Object (Needed for chatbot logic)
|
| 104 |
+
chat_memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True, output_key='answer')
|
| 105 |
+
|
| 106 |
+
print("--- Local Setup Complete ---")
|
| 107 |
+
|
| 108 |
+
|
| 109 |
+
# --- Main Execution Logic ---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
if __name__ == "__main__":
|
| 111 |
+
parser = argparse.ArgumentParser(description="Run ReviewSense Analysis or Chat locally.")
|
| 112 |
+
parser.add_argument("--mode", choices=['analyze', 'chat'], required=True, help="Mode to run: 'analyze' reviews from a file, or 'chat' interactively.")
|
| 113 |
+
parser.add_argument("--input", type=str, help="Path to input .txt file for 'analyze' mode, or initial query for 'chat' mode.")
|
| 114 |
+
parser.add_argument("--context", type=str, help="Optional: Path to a .txt file to use as context for 'chat' mode (defaults to built-in laptop reviews).")
|
| 115 |
+
|
| 116 |
+
args = parser.parse_args()
|
| 117 |
+
|
| 118 |
+
# --- ANALYZE MODE ---
|
| 119 |
+
if args.mode == 'analyze':
|
| 120 |
+
if not args.input or not os.path.exists(args.input):
|
| 121 |
+
print(f"Error: Input file '{args.input}' not found for analyze mode.")
|
| 122 |
+
exit()
|
| 123 |
+
print(f"\n--- Running Analysis on: {args.input} ---")
|
| 124 |
+
try:
|
| 125 |
+
with open(args.input, 'r', encoding='utf-8') as f:
|
| 126 |
+
review_content = f.read()
|
| 127 |
+
except Exception as e:
|
| 128 |
+
print(f"Error reading input file: {e}")
|
| 129 |
+
exit()
|
| 130 |
+
|
| 131 |
+
summary, aspects, sentiment = pipeline.analyze_reviews_logic(
|
| 132 |
+
review_content, llm, summary_prompt, aspect_prompt, sentiment_prompt
|
| 133 |
+
)
|
| 134 |
+
print("\n--- Analysis Results ---")
|
| 135 |
+
print("\n[Summary]")
|
| 136 |
+
print(summary)
|
| 137 |
+
print("\n[Aspects]")
|
| 138 |
+
print(aspects)
|
| 139 |
+
print("\n[Sentiment]")
|
| 140 |
+
print(sentiment)
|
| 141 |
+
|
| 142 |
+
# --- CHAT MODE ---
|
| 143 |
+
elif args.mode == 'chat':
|
| 144 |
+
print("\n--- Starting Interactive Chat ---")
|
| 145 |
+
# Determine context
|
| 146 |
+
chat_vector_store = default_vector_store
|
| 147 |
+
context_name = "Default Laptop Reviews"
|
| 148 |
+
if args.context:
|
| 149 |
+
if not os.path.exists(args.context):
|
| 150 |
+
print(f"Warning: Context file '{args.context}' not found. Using default context.")
|
| 151 |
+
else:
|
| 152 |
+
print(f"Loading context from: {args.context}")
|
| 153 |
+
try:
|
| 154 |
+
with open(args.context, 'r', encoding='utf-8') as f:
|
| 155 |
+
context_content = f.read()
|
| 156 |
+
chat_vector_store = pipeline.create_vector_store_from_content(
|
| 157 |
+
context_content, text_splitter, embeddings
|
| 158 |
+
)
|
| 159 |
+
if chat_vector_store:
|
| 160 |
+
context_name = f"File: {os.path.basename(args.context)}"
|
| 161 |
+
else:
|
| 162 |
+
print("Failed to load context file. Using default context.")
|
| 163 |
+
chat_vector_store = default_vector_store
|
| 164 |
+
except Exception as e:
|
| 165 |
+
print(f"Error reading context file '{args.context}': {e}. Using default context.")
|
| 166 |
+
chat_vector_store = default_vector_store
|
| 167 |
+
|
| 168 |
+
print(f"Using context: {context_name}")
|
| 169 |
+
chat_memory.clear() # Start fresh chat session
|
| 170 |
+
|
| 171 |
+
# Handle initial query if provided
|
| 172 |
+
if args.input:
|
| 173 |
+
print("\nUser:", args.input)
|
| 174 |
+
response = pipeline.get_chatbot_response(
|
| 175 |
+
message=args.input,
|
| 176 |
+
chat_memory=chat_memory,
|
| 177 |
+
vector_store=chat_vector_store,
|
| 178 |
+
llm=llm,
|
| 179 |
+
intent_prompt=intent_prompt,
|
| 180 |
+
condense_prompt=CONDENSE_QUESTION_PROMPT,
|
| 181 |
+
qa_prompt=qa_prompt
|
| 182 |
+
)
|
| 183 |
+
print("\nAssistant:", response)
|
| 184 |
+
|
| 185 |
+
# Interactive loop
|
| 186 |
+
print("\nEnter your questions (type 'quit' or 'exit' to stop):")
|
| 187 |
+
while True:
|
| 188 |
+
try:
|
| 189 |
+
user_message = input("\nUser: ")
|
| 190 |
+
if user_message.lower() in ['quit', 'exit']:
|
| 191 |
+
break
|
| 192 |
+
if not user_message:
|
| 193 |
+
continue
|
| 194 |
+
|
| 195 |
+
response = pipeline.get_chatbot_response(
|
| 196 |
+
message=user_message,
|
| 197 |
+
chat_memory=chat_memory,
|
| 198 |
+
vector_store=chat_vector_store,
|
| 199 |
+
llm=llm,
|
| 200 |
+
intent_prompt=intent_prompt,
|
| 201 |
+
condense_prompt=CONDENSE_QUESTION_PROMPT,
|
| 202 |
+
qa_prompt=qa_prompt
|
| 203 |
+
)
|
| 204 |
+
print("\nAssistant:", response)
|
| 205 |
+
|
| 206 |
+
except EOFError: # Handle Ctrl+D
|
| 207 |
+
break
|
| 208 |
+
except KeyboardInterrupt: # Handle Ctrl+C
|
| 209 |
+
break
|
| 210 |
+
print("\n--- Chat session ended. ---")
|
| 211 |
+
|
| 212 |
+
print("\n--- Local Execution Finished ---")
|
scripts/models.py
DELETED
|
@@ -1,256 +0,0 @@
|
|
| 1 |
-
import pytorch_lightning as pl
|
| 2 |
-
from transformers import AutoModelForSequenceClassification, get_linear_schedule_with_warmup, AutoConfig
|
| 3 |
-
from torch.optim import AdamW
|
| 4 |
-
import torch
|
| 5 |
-
from torchmetrics.functional import accuracy
|
| 6 |
-
from transformers import T5ForConditionalGeneration, T5Tokenizer, AutoTokenizer, pipeline
|
| 7 |
-
|
| 8 |
-
class SentimentClassifier(pl.LightningModule):
|
| 9 |
-
"""
|
| 10 |
-
PyTorch Lightning module for the sentiment classification model.
|
| 11 |
-
"""
|
| 12 |
-
def __init__(self, model_name='distilbert-base-uncased', n_classes=2, learning_rate=2e-5, n_warmup_steps=0, n_training_steps=0, dropout_prob=0.2): # Added dropout
|
| 13 |
-
super().__init__()
|
| 14 |
-
self.save_hyperparameters()
|
| 15 |
-
|
| 16 |
-
# Configure dropout
|
| 17 |
-
config = AutoConfig.from_pretrained(model_name)
|
| 18 |
-
config.hidden_dropout_prob = dropout_prob
|
| 19 |
-
config.attention_probs_dropout_prob = dropout_prob
|
| 20 |
-
config.num_labels = n_classes
|
| 21 |
-
|
| 22 |
-
self.model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config)
|
| 23 |
-
|
| 24 |
-
def forward(self, input_ids, attention_mask, labels=None):
|
| 25 |
-
return self.model(
|
| 26 |
-
input_ids=input_ids,
|
| 27 |
-
attention_mask=attention_mask,
|
| 28 |
-
labels=labels
|
| 29 |
-
)
|
| 30 |
-
|
| 31 |
-
def training_step(self, batch, batch_idx):
|
| 32 |
-
output = self.forward(**batch)
|
| 33 |
-
self.log("train_loss", output.loss, prog_bar=True, logger=True)
|
| 34 |
-
return output.loss
|
| 35 |
-
|
| 36 |
-
def validation_step(self, batch, batch_idx):
|
| 37 |
-
output = self.forward(**batch)
|
| 38 |
-
preds = torch.argmax(output.logits, dim=1)
|
| 39 |
-
val_acc = accuracy(preds, batch['labels'], task='binary')
|
| 40 |
-
self.log("val_loss", output.loss, prog_bar=True, logger=True)
|
| 41 |
-
self.log("val_accuracy", val_acc, prog_bar=True, logger=True)
|
| 42 |
-
return output.loss
|
| 43 |
-
|
| 44 |
-
def test_step(self, batch, batch_idx):
|
| 45 |
-
output = self.forward(**batch)
|
| 46 |
-
preds = torch.argmax(output.logits, dim=1)
|
| 47 |
-
test_acc = accuracy(preds, batch['labels'], task='binary')
|
| 48 |
-
self.log("test_accuracy", test_acc)
|
| 49 |
-
return test_acc
|
| 50 |
-
|
| 51 |
-
def predict_step(self, batch, batch_idx, dataloader_idx=0):
|
| 52 |
-
output = self.forward(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'])
|
| 53 |
-
return torch.argmax(output.logits, dim=1)
|
| 54 |
-
|
| 55 |
-
def configure_optimizers(self):
|
| 56 |
-
optimizer = AdamW(self.parameters(), lr=self.hparams.learning_rate, weight_decay=0.01)
|
| 57 |
-
scheduler = get_linear_schedule_with_warmup(
|
| 58 |
-
optimizer,
|
| 59 |
-
num_warmup_steps=self.hparams.n_warmup_steps,
|
| 60 |
-
num_training_steps=self.hparams.n_training_steps
|
| 61 |
-
)
|
| 62 |
-
return dict(optimizer=optimizer, lr_scheduler=dict(scheduler=scheduler, interval='step'))
|
| 63 |
-
|
| 64 |
-
class ReviewSummarizer:
|
| 65 |
-
"""
|
| 66 |
-
A class to handle the summarization of product reviews using a pre-trained T5 model.
|
| 67 |
-
"""
|
| 68 |
-
def __init__(self, model_name='t5-small'):
|
| 69 |
-
"""
|
| 70 |
-
Initializes the summarizer with a pre-trained T5 model and tokenizer.
|
| 71 |
-
|
| 72 |
-
Args:
|
| 73 |
-
model_name (str): The name of the pre-trained T5 model to use.
|
| 74 |
-
"""
|
| 75 |
-
print(f"Loading summarization model: {model_name}...")
|
| 76 |
-
self.model_name = model_name
|
| 77 |
-
self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
| 78 |
-
|
| 79 |
-
# Load the tokenizer and model from Hugging Face
|
| 80 |
-
self.tokenizer = T5Tokenizer.from_pretrained(self.model_name)
|
| 81 |
-
self.model = T5ForConditionalGeneration.from_pretrained(self.model_name).to(self.device)
|
| 82 |
-
print("Summarization model loaded successfully.")
|
| 83 |
-
|
| 84 |
-
def summarize(self, text: str, max_length: int = 50, min_length: int = 10) -> str:
|
| 85 |
-
"""
|
| 86 |
-
Generates a summary for a given text.
|
| 87 |
-
|
| 88 |
-
Args:
|
| 89 |
-
text (str): The review text to summarize.
|
| 90 |
-
max_length (int): The maximum length of the generated summary.
|
| 91 |
-
min_length (int): The minimum length of the generated summary.
|
| 92 |
-
|
| 93 |
-
Returns:
|
| 94 |
-
str: The generated summary.
|
| 95 |
-
"""
|
| 96 |
-
if not text or not isinstance(text, str):
|
| 97 |
-
return ""
|
| 98 |
-
|
| 99 |
-
# T5 models require a prefix for the task. For summarization, it's "summarize: "
|
| 100 |
-
preprocess_text = f"summarize: {text.strip()}"
|
| 101 |
-
|
| 102 |
-
# Tokenize the input text
|
| 103 |
-
tokenized_text = self.tokenizer.encode(preprocess_text, return_tensors="pt").to(self.device)
|
| 104 |
-
|
| 105 |
-
# Generate the summary
|
| 106 |
-
summary_ids = self.model.generate(
|
| 107 |
-
tokenized_text,
|
| 108 |
-
max_length=max_length,
|
| 109 |
-
min_length=min_length,
|
| 110 |
-
length_penalty=2.0,
|
| 111 |
-
num_beams=4,
|
| 112 |
-
early_stopping=True
|
| 113 |
-
)
|
| 114 |
-
|
| 115 |
-
# Decode the summary and return it
|
| 116 |
-
summary = self.tokenizer.decode(summary_ids[0], skip_special_tokens=True)
|
| 117 |
-
return summary
|
| 118 |
-
|
| 119 |
-
class AspectAnalyzer:
|
| 120 |
-
"""
|
| 121 |
-
A class to handle Aspect-Based Sentiment Analysis (ABSA) using a pre-trained model.
|
| 122 |
-
"""
|
| 123 |
-
# Changed to a different, currently valid lightweight model for ABSA.
|
| 124 |
-
def __init__(self, model_name='yangheng/deberta-v3-base-absa-v1.1', force_cpu=False):
|
| 125 |
-
"""
|
| 126 |
-
Initializes the ABSA pipeline with a pre-trained model.
|
| 127 |
-
|
| 128 |
-
Args:
|
| 129 |
-
model_name (str): The name of the pre-trained ABSA model.
|
| 130 |
-
force_cpu (bool): If True, forces the model to run on the CPU.
|
| 131 |
-
"""
|
| 132 |
-
print(f"Loading Aspect-Based Sentiment Analysis model: {model_name}...")
|
| 133 |
-
self.model_name = model_name
|
| 134 |
-
|
| 135 |
-
if force_cpu:
|
| 136 |
-
self.device = -1 # Use -1 for CPU in pipeline
|
| 137 |
-
print("Forcing ABSA model to run on CPU.")
|
| 138 |
-
else:
|
| 139 |
-
self.device = 0 if torch.cuda.is_available() else -1
|
| 140 |
-
|
| 141 |
-
print(f"Using device: {self.device} (0 for GPU, -1 for CPU)")
|
| 142 |
-
|
| 143 |
-
self.absa_pipeline = pipeline(
|
| 144 |
-
"text-classification",
|
| 145 |
-
model=self.model_name,
|
| 146 |
-
tokenizer=self.model_name,
|
| 147 |
-
device=self.device
|
| 148 |
-
)
|
| 149 |
-
print("ABSA model loaded successfully.")
|
| 150 |
-
|
| 151 |
-
def analyze(self, text: str, aspects: list) -> dict:
|
| 152 |
-
"""
|
| 153 |
-
Analyzes the sentiment towards a list of aspects within a given text.
|
| 154 |
-
"""
|
| 155 |
-
if not text or not isinstance(text, str) or not aspects:
|
| 156 |
-
return {}
|
| 157 |
-
|
| 158 |
-
# The model expects the review and aspect separated by a special token.
|
| 159 |
-
# Note: Different ABSA models might expect different input formats.
|
| 160 |
-
# This format is common but may need adjustment for other models.
|
| 161 |
-
inputs = [f"{text} [SEP] {aspect}" for aspect in aspects]
|
| 162 |
-
results = self.absa_pipeline(inputs)
|
| 163 |
-
|
| 164 |
-
# Process results into a user-friendly dictionary
|
| 165 |
-
aspect_sentiments = {}
|
| 166 |
-
for aspect, result in zip(aspects, results):
|
| 167 |
-
aspect_sentiments[aspect] = {'sentiment': result['label'], 'score': result['score']}
|
| 168 |
-
|
| 169 |
-
return aspect_sentiments
|
| 170 |
-
|
| 171 |
-
class FineTunedSentimentClassifier:
|
| 172 |
-
"""
|
| 173 |
-
This class handles loading the fine-tuned checkpoint and making predictions.
|
| 174 |
-
"""
|
| 175 |
-
def __init__(self, checkpoint_path, model_name='distilbert-base-uncased', force_cpu=False):
|
| 176 |
-
self.device = 'cpu' if force_cpu else ('cuda' if torch.cuda.is_available() else 'cpu')
|
| 177 |
-
print(f"Loading fine-tuned sentiment model from checkpoint: {checkpoint_path}...")
|
| 178 |
-
print(f"Using device: {self.device}")
|
| 179 |
-
|
| 180 |
-
self.model = SentimentClassifier.load_from_checkpoint(checkpoint_path, map_location=self.device)
|
| 181 |
-
self.model.to(self.device)
|
| 182 |
-
self.model.eval() # Set model to evaluation mode
|
| 183 |
-
|
| 184 |
-
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 185 |
-
self.labels = ['NEGATIVE', 'POSITIVE']
|
| 186 |
-
print("Fine-tuned sentiment model loaded successfully.")
|
| 187 |
-
|
| 188 |
-
def classify(self, text: str) -> dict:
|
| 189 |
-
encoding = self.tokenizer.encode_plus(
|
| 190 |
-
text, add_special_tokens=True, max_length=128,
|
| 191 |
-
return_token_type_ids=False, padding="max_length",
|
| 192 |
-
truncation=True, return_attention_mask=True, return_tensors='pt',
|
| 193 |
-
)
|
| 194 |
-
input_ids = encoding["input_ids"].to(self.device)
|
| 195 |
-
attention_mask = encoding["attention_mask"].to(self.device)
|
| 196 |
-
with torch.no_grad():
|
| 197 |
-
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
|
| 198 |
-
logits = outputs.logits
|
| 199 |
-
probabilities = torch.softmax(logits, dim=1)
|
| 200 |
-
prediction_idx = torch.argmax(probabilities, dim=1).item()
|
| 201 |
-
return {'label': self.labels[prediction_idx], 'score': probabilities[0][prediction_idx].item()}
|
| 202 |
-
|
| 203 |
-
class AspectExtractor:
|
| 204 |
-
"""
|
| 205 |
-
This class uses a Part-of-Speech (POS) tagging model to first extract all
|
| 206 |
-
potential aspect terms (nouns) from a review text. It then filters these
|
| 207 |
-
nouns against a pre-defined dictionary of valid aspects for a given
|
| 208 |
-
product category to return only the relevant features.
|
| 209 |
-
"""
|
| 210 |
-
def __init__(self, model_name="vblagoje/bert-english-uncased-finetuned-pos", force_cpu=False):
|
| 211 |
-
self.model_name = model_name
|
| 212 |
-
self.device = 'cpu' if force_cpu else ('cuda' if torch.cuda.is_available() else 'cpu')
|
| 213 |
-
print(f"Loading Part-of-Speech (POS) tagging model: {self.model_name}...")
|
| 214 |
-
print(f"Using device: {self.device}")
|
| 215 |
-
|
| 216 |
-
self.pipeline = pipeline(
|
| 217 |
-
"token-classification",
|
| 218 |
-
model=self.model_name,
|
| 219 |
-
device=-1 if self.device == 'cpu' else 0,
|
| 220 |
-
aggregation_strategy="simple"
|
| 221 |
-
)
|
| 222 |
-
print("POS tagging model loaded successfully.")
|
| 223 |
-
|
| 224 |
-
def extract(self, text: str, aspect_dictionary: list) -> list:
|
| 225 |
-
"""
|
| 226 |
-
Extracts aspects from the given text that are present in the provided
|
| 227 |
-
aspect dictionary.
|
| 228 |
-
|
| 229 |
-
Args:
|
| 230 |
-
text (str): The review text to analyze.
|
| 231 |
-
aspect_dictionary (list): A list of valid, known aspects for the
|
| 232 |
-
product category.
|
| 233 |
-
|
| 234 |
-
Returns:
|
| 235 |
-
list: A list of aspects that were both found in the text and are
|
| 236 |
-
present in the aspect dictionary.
|
| 237 |
-
"""
|
| 238 |
-
if not text or not aspect_dictionary:
|
| 239 |
-
return []
|
| 240 |
-
|
| 241 |
-
# 1. Extract all nouns from the text using the POS model
|
| 242 |
-
model_outputs = self.pipeline(text)
|
| 243 |
-
noun_tags = {'NOUN', 'PROPN'}
|
| 244 |
-
extracted_nouns = {
|
| 245 |
-
output['word'].lower() for output in model_outputs
|
| 246 |
-
if output['entity_group'] in noun_tags
|
| 247 |
-
}
|
| 248 |
-
|
| 249 |
-
# 2. Filter the extracted nouns against the provided dictionary
|
| 250 |
-
# We find the intersection between the two sets.
|
| 251 |
-
valid_aspects = {aspect.lower() for aspect in aspect_dictionary}
|
| 252 |
-
|
| 253 |
-
final_aspects = list(extracted_nouns.intersection(valid_aspects))
|
| 254 |
-
|
| 255 |
-
return final_aspects
|
| 256 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/pipeline.py
ADDED
|
@@ -0,0 +1,127 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# src/pipeline.py
|
| 2 |
+
|
| 3 |
+
import os
|
| 4 |
+
import io
|
| 5 |
+
from langchain_community.vectorstores import FAISS
|
| 6 |
+
from langchain.chains import ConversationalRetrievalChain
|
| 7 |
+
from langchain.prompts import PromptTemplate # Ensure necessary Langchain imports are here if needed directly
|
| 8 |
+
|
| 9 |
+
# --- Core Logic Functions ---
|
| 10 |
+
|
| 11 |
+
def analyze_reviews_logic(review_text: str, llm, summary_prompt, aspect_prompt, sentiment_prompt):
|
| 12 |
+
"""
|
| 13 |
+
Performs Phase 1 analysis (summary, aspects, sentiment) on the provided text.
|
| 14 |
+
"""
|
| 15 |
+
print(f"Running batch analysis logic on {len(review_text)} chars...")
|
| 16 |
+
try:
|
| 17 |
+
summary_result = llm.invoke(summary_prompt.format(reviews=review_text)).strip()
|
| 18 |
+
print(" -> Summary generated.")
|
| 19 |
+
aspect_result = llm.invoke(aspect_prompt.format(reviews=review_text)).strip()
|
| 20 |
+
print(" -> Aspects extracted.")
|
| 21 |
+
sentiment_result = llm.invoke(sentiment_prompt.format(reviews=review_text)).strip()
|
| 22 |
+
print(" -> Sentiment analyzed.")
|
| 23 |
+
return summary_result, aspect_result, sentiment_result
|
| 24 |
+
except Exception as e:
|
| 25 |
+
print(f"ERROR during batch analysis logic: {e}")
|
| 26 |
+
error_msg = f"Error during analysis: {e}"
|
| 27 |
+
return error_msg, error_msg, error_msg
|
| 28 |
+
|
| 29 |
+
def create_vector_store_from_content(content: str, text_splitter, embeddings):
|
| 30 |
+
"""
|
| 31 |
+
Splits content and creates a new FAISS vector store.
|
| 32 |
+
Returns the vector store or None if an error occurs.
|
| 33 |
+
"""
|
| 34 |
+
print("Creating new vector store from content...")
|
| 35 |
+
if not content:
|
| 36 |
+
print("Error: No content provided to create vector store.")
|
| 37 |
+
return None
|
| 38 |
+
|
| 39 |
+
# Split content
|
| 40 |
+
if "\n---\n" in content:
|
| 41 |
+
reviews_list = [r.strip() for r in content.strip().split('\n---\n') if r.strip()]
|
| 42 |
+
else:
|
| 43 |
+
reviews_list = [r.strip() for r in content.strip().split('\n\n') if r.strip()]
|
| 44 |
+
if len(reviews_list) <= 1: reviews_list = [content.strip()] # Single block case
|
| 45 |
+
|
| 46 |
+
if not reviews_list:
|
| 47 |
+
print("Error: Could not extract reviews from content.")
|
| 48 |
+
return None
|
| 49 |
+
|
| 50 |
+
review_chunks = text_splitter.create_documents(reviews_list)
|
| 51 |
+
if not review_chunks:
|
| 52 |
+
print("Error: Failed to create document chunks.")
|
| 53 |
+
return None
|
| 54 |
+
|
| 55 |
+
try:
|
| 56 |
+
vector_store = FAISS.from_documents(review_chunks, embeddings)
|
| 57 |
+
print("Vector store created successfully.")
|
| 58 |
+
return vector_store
|
| 59 |
+
except Exception as e:
|
| 60 |
+
print(f"Error creating FAISS index: {e}")
|
| 61 |
+
return None
|
| 62 |
+
|
| 63 |
+
def parse_intent(llm_output: str) -> str:
|
| 64 |
+
"""
|
| 65 |
+
Parses the LLM output to find 'Product' or 'Off-Topic'.
|
| 66 |
+
Defaults to 'Off-Topic' if neither is found or output is unexpected.
|
| 67 |
+
Uses case-insensitive 'in' check for robustness.
|
| 68 |
+
"""
|
| 69 |
+
output_lower = llm_output.strip().lower()
|
| 70 |
+
if "product" in output_lower:
|
| 71 |
+
return "Product"
|
| 72 |
+
elif "off-topic" in output_lower:
|
| 73 |
+
return "Off-Topic"
|
| 74 |
+
else:
|
| 75 |
+
print(f" -> Unexpected classification: '{llm_output.strip()}'. Defaulting to Off-Topic.")
|
| 76 |
+
return "Off-Topic"
|
| 77 |
+
|
| 78 |
+
def get_chatbot_response(message: str, chat_memory, vector_store, llm, intent_prompt, condense_prompt, qa_prompt):
|
| 79 |
+
"""
|
| 80 |
+
Handles Phase 2: Classifies intent and runs RAG if appropriate.
|
| 81 |
+
Returns the chatbot's response string.
|
| 82 |
+
"""
|
| 83 |
+
print(f"\nProcessing chatbot query: {message}")
|
| 84 |
+
|
| 85 |
+
# --- 1. Classify Intent ---
|
| 86 |
+
formatted_intent_prompt = intent_prompt.format(query=message)
|
| 87 |
+
intent_result_raw = llm.invoke(formatted_intent_prompt)
|
| 88 |
+
print(f" DEBUG: Raw Intent Output: '{intent_result_raw.strip()}'")
|
| 89 |
+
intent = parse_intent(intent_result_raw)
|
| 90 |
+
print(f" -> Detected Intent: {intent}")
|
| 91 |
+
|
| 92 |
+
# --- 2. Route ---
|
| 93 |
+
if intent == "Product":
|
| 94 |
+
print(" -> Routing to RAG chain...")
|
| 95 |
+
if vector_store is None:
|
| 96 |
+
print(" ERROR: No vector store available for RAG.")
|
| 97 |
+
return "Sorry, I don't have any review context loaded to answer product questions."
|
| 98 |
+
|
| 99 |
+
retriever = vector_store.as_retriever(search_kwargs={"k": 4})
|
| 100 |
+
|
| 101 |
+
# Create chain dynamically for each call
|
| 102 |
+
conv_qa_chain = ConversationalRetrievalChain.from_llm(
|
| 103 |
+
llm=llm,
|
| 104 |
+
retriever=retriever,
|
| 105 |
+
memory=chat_memory,
|
| 106 |
+
condense_question_prompt=condense_prompt,
|
| 107 |
+
combine_docs_chain_kwargs={"prompt": qa_prompt},
|
| 108 |
+
return_source_documents=True, # Required for context list in result
|
| 109 |
+
verbose=False
|
| 110 |
+
)
|
| 111 |
+
try:
|
| 112 |
+
# Pass only question - memory handles history internally
|
| 113 |
+
result = conv_qa_chain.invoke({"question": message})
|
| 114 |
+
answer = result['answer'].strip()
|
| 115 |
+
print(f" -> RAG Answer: {answer}")
|
| 116 |
+
return answer
|
| 117 |
+
except Exception as e:
|
| 118 |
+
print(f"ERROR during RAG chain execution: {e}")
|
| 119 |
+
# Optionally log traceback: import traceback; traceback.print_exc()
|
| 120 |
+
return "Sorry, I encountered an error trying to find an answer in the reviews."
|
| 121 |
+
|
| 122 |
+
else: # Off-Topic
|
| 123 |
+
print(" -> Routing to canned response...")
|
| 124 |
+
answer = "I'm sorry, I can only answer questions about the product reviews for this item."
|
| 125 |
+
# Optional: Save off-topic turn to memory if desired
|
| 126 |
+
# chat_memory.save_context({"question": message}, {"answer": answer})
|
| 127 |
+
return answer
|
scripts/train_distilbet.py
DELETED
|
@@ -1,101 +0,0 @@
|
|
| 1 |
-
import pytorch_lightning as pl
|
| 2 |
-
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping
|
| 3 |
-
from pytorch_lightning.loggers import TensorBoardLogger
|
| 4 |
-
import torch
|
| 5 |
-
import seaborn as sns
|
| 6 |
-
import matplotlib.pyplot as plt
|
| 7 |
-
from sklearn.metrics import confusion_matrix
|
| 8 |
-
from data_prepare import ReviewDataModule, ReviewDataset
|
| 9 |
-
from models import SentimentClassifier
|
| 10 |
-
|
| 11 |
-
def train_sentiment_model(data_path='data/reviews_processed.csv', model_name='distilbert-base-uncased', n_epochs=5, sample_size: int = None):
|
| 12 |
-
"""
|
| 13 |
-
Main function to train the sentiment analysis model on the Amazon Reviews dataset.
|
| 14 |
-
|
| 15 |
-
Args:
|
| 16 |
-
data_path (str): Path to the processed data file.
|
| 17 |
-
model_name (str): Name of the transformer model to use.
|
| 18 |
-
n_epochs (int): Maximum number of epochs for training.
|
| 19 |
-
sample_size (int, optional): The number of reviews to use for training.
|
| 20 |
-
If None, the full dataset is used.
|
| 21 |
-
"""
|
| 22 |
-
# --- 1. Hyperparameters ---
|
| 23 |
-
BATCH_SIZE = 64
|
| 24 |
-
MAX_TOKEN_LEN = 256
|
| 25 |
-
LEARNING_RATE = 2e-5
|
| 26 |
-
N_CLASSES = 2 # Negative, Positive
|
| 27 |
-
|
| 28 |
-
# --- 2. Initialize DataModule ---
|
| 29 |
-
print("Initializing ReviewDataModule...")
|
| 30 |
-
review_datamodule = ReviewDataModule(
|
| 31 |
-
data_path=data_path,
|
| 32 |
-
batch_size=BATCH_SIZE,
|
| 33 |
-
max_token_len=MAX_TOKEN_LEN,
|
| 34 |
-
model_name=model_name,
|
| 35 |
-
sample_size=sample_size # Pass the sample size to the datamodule
|
| 36 |
-
)
|
| 37 |
-
review_datamodule.setup()
|
| 38 |
-
|
| 39 |
-
n_training_steps = len(review_datamodule.train_dataloader()) * n_epochs
|
| 40 |
-
n_warmup_steps = int(n_training_steps * 0.1)
|
| 41 |
-
|
| 42 |
-
# --- 3. Initialize Model ---
|
| 43 |
-
print("Initializing SentimentClassifier model...")
|
| 44 |
-
model = SentimentClassifier(
|
| 45 |
-
model_name=model_name,
|
| 46 |
-
n_classes=N_CLASSES,
|
| 47 |
-
learning_rate=LEARNING_RATE,
|
| 48 |
-
n_warmup_steps=n_warmup_steps,
|
| 49 |
-
n_training_steps=n_training_steps
|
| 50 |
-
)
|
| 51 |
-
|
| 52 |
-
# --- 4. Configure Training Callbacks ---
|
| 53 |
-
checkpoint_callback = ModelCheckpoint(
|
| 54 |
-
dirpath="checkpoints",
|
| 55 |
-
filename="sentiment-binary-best-checkpoint",
|
| 56 |
-
save_top_k=1,
|
| 57 |
-
verbose=True,
|
| 58 |
-
monitor="val_loss",
|
| 59 |
-
mode="min"
|
| 60 |
-
)
|
| 61 |
-
logger = TensorBoardLogger("lightning_logs", name="sentiment-classifier-binary")
|
| 62 |
-
early_stopping_callback = EarlyStopping(monitor='val_loss', patience=2)
|
| 63 |
-
|
| 64 |
-
# --- 5. Initialize Trainer ---
|
| 65 |
-
print("Initializing PyTorch Lightning Trainer...")
|
| 66 |
-
trainer = pl.Trainer(
|
| 67 |
-
logger=logger,
|
| 68 |
-
callbacks=[checkpoint_callback, early_stopping_callback],
|
| 69 |
-
max_epochs=n_epochs,
|
| 70 |
-
accelerator='gpu' if torch.cuda.is_available() else 'cpu',
|
| 71 |
-
devices=1,
|
| 72 |
-
)
|
| 73 |
-
|
| 74 |
-
# --- 6. Start Training ---
|
| 75 |
-
print(f"Starting training with {model_name} for up to {n_epochs} epochs...")
|
| 76 |
-
trainer.fit(model, review_datamodule)
|
| 77 |
-
|
| 78 |
-
# --- 7. Evaluate on Test Set and Generate Confusion Matrix ---
|
| 79 |
-
print("\nTraining complete. Evaluating on the test set...")
|
| 80 |
-
trainer.test(model, datamodule=review_datamodule)
|
| 81 |
-
|
| 82 |
-
predictions = trainer.predict(model, datamodule=review_datamodule)
|
| 83 |
-
if predictions:
|
| 84 |
-
all_preds = torch.cat(predictions).cpu().numpy()
|
| 85 |
-
true_labels = review_datamodule.test_df.sentiment.to_numpy()
|
| 86 |
-
target_names = ['Negative', 'Positive'] # Updated labels
|
| 87 |
-
|
| 88 |
-
cm = confusion_matrix(true_labels, all_preds)
|
| 89 |
-
plt.figure(figsize=(8, 6))
|
| 90 |
-
sns.heatmap(cm, annot=True, fmt='d', cmap='YlGnBu',
|
| 91 |
-
xticklabels=target_names, yticklabels=target_names)
|
| 92 |
-
plt.title('Confusion Matrix for Sentiment Analysis')
|
| 93 |
-
plt.xlabel('Predicted Label')
|
| 94 |
-
plt.ylabel('True Label')
|
| 95 |
-
plt.show()
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
if __name__ == "__main__":
|
| 100 |
-
data_path = "data/reviews_processed.csv"
|
| 101 |
-
train_sentiment_model(data_path=data_path, sample_size=100000)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/train_naive_bayes.py
DELETED
|
@@ -1,118 +0,0 @@
|
|
| 1 |
-
import pandas as pd
|
| 2 |
-
import numpy as np
|
| 3 |
-
from sklearn.model_selection import train_test_split, ParameterGrid, StratifiedKFold
|
| 4 |
-
from sklearn.feature_extraction.text import TfidfVectorizer
|
| 5 |
-
from sklearn.naive_bayes import MultinomialNB
|
| 6 |
-
from sklearn.pipeline import Pipeline
|
| 7 |
-
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
|
| 8 |
-
import seaborn as sns
|
| 9 |
-
import matplotlib.pyplot as plt
|
| 10 |
-
from tqdm.notebook import tqdm
|
| 11 |
-
import os
|
| 12 |
-
|
| 13 |
-
def train_baseline_sentiment_model(data_path='data/reviews_processed.csv', grid_search=True, nb__alpha=0.1, tfidf__max_df=0.75, tfidf__ngram_range=(1, 2), sample_size: int = 50000):
|
| 14 |
-
"""
|
| 15 |
-
Trains and evaluates a Multinomial Naive Bayes model for sentiment analysis.
|
| 16 |
-
Can optionally perform a grid search.
|
| 17 |
-
|
| 18 |
-
Args:
|
| 19 |
-
data_path (str): Path to the processed reviews CSV file.
|
| 20 |
-
grid_search (bool): If True, performs a grid search.
|
| 21 |
-
nb__alpha (float): Alpha for MultinomialNB.
|
| 22 |
-
tfidf__max_df (float): max_df for TfidfVectorizer.
|
| 23 |
-
tfidf__ngram_range (tuple): ngram_range for TfidfVectorizer.
|
| 24 |
-
sample_size (int, optional): Number of reviews to use. If None, uses all.
|
| 25 |
-
"""
|
| 26 |
-
# --- 1. Load Data ---
|
| 27 |
-
print(f"Loading data from '{data_path}'...")
|
| 28 |
-
if not os.path.exists(data_path):
|
| 29 |
-
print(f"\nERROR: '{data_path}' not found. Please run the EDA script first!")
|
| 30 |
-
return
|
| 31 |
-
|
| 32 |
-
df = pd.read_csv(data_path)
|
| 33 |
-
df.dropna(inplace=True)
|
| 34 |
-
|
| 35 |
-
# --- 2. Sample Data ---
|
| 36 |
-
if sample_size:
|
| 37 |
-
print(f"Using a sample of {sample_size} reviews for training the baseline model.")
|
| 38 |
-
df = df.sample(n=sample_size, random_state=42)
|
| 39 |
-
|
| 40 |
-
# --- 3. Train-Test Split ---
|
| 41 |
-
print("Splitting data into training and testing sets...")
|
| 42 |
-
X_train, X_test, y_train, y_test = train_test_split(
|
| 43 |
-
df['full_text'],
|
| 44 |
-
df['sentiment'],
|
| 45 |
-
test_size=0.2,
|
| 46 |
-
random_state=42,
|
| 47 |
-
stratify=df['sentiment']
|
| 48 |
-
)
|
| 49 |
-
|
| 50 |
-
# --- 4. Create a Pipeline ---
|
| 51 |
-
pipeline = Pipeline([
|
| 52 |
-
('tfidf', TfidfVectorizer(stop_words='english')),
|
| 53 |
-
('nb', MultinomialNB()),
|
| 54 |
-
])
|
| 55 |
-
|
| 56 |
-
best_params = None
|
| 57 |
-
|
| 58 |
-
if grid_search:
|
| 59 |
-
# --- 5a. Perform Grid Search ---
|
| 60 |
-
print("Performing Grid Search to find the best hyperparameters...")
|
| 61 |
-
parameters = {
|
| 62 |
-
'tfidf__ngram_range': [(1, 1), (1, 2)],
|
| 63 |
-
'tfidf__max_df': [0.5, 0.75, 1.0],
|
| 64 |
-
'nb__alpha': [0.1, 0.5, 1.0],
|
| 65 |
-
}
|
| 66 |
-
param_grid = list(ParameterGrid(parameters))
|
| 67 |
-
best_score = -1
|
| 68 |
-
|
| 69 |
-
for params in tqdm(param_grid, desc="Grid Search Progress"):
|
| 70 |
-
pipeline.set_params(**params)
|
| 71 |
-
pipeline.fit(X_train, y_train)
|
| 72 |
-
score = pipeline.score(X_test, y_test)
|
| 73 |
-
if score > best_score:
|
| 74 |
-
best_score = score
|
| 75 |
-
best_params = params
|
| 76 |
-
|
| 77 |
-
print(f"\nBest score on test set: {best_score:.4f}")
|
| 78 |
-
print("Best parameters found:")
|
| 79 |
-
print(best_params)
|
| 80 |
-
|
| 81 |
-
else:
|
| 82 |
-
# --- 5b. Use provided hyperparameters ---
|
| 83 |
-
print("Skipping grid search and using provided hyperparameters...")
|
| 84 |
-
best_params = {
|
| 85 |
-
'nb__alpha': nb__alpha,
|
| 86 |
-
'tfidf__max_df': tfidf__max_df,
|
| 87 |
-
'tfidf__ngram_range': tfidf__ngram_range
|
| 88 |
-
}
|
| 89 |
-
|
| 90 |
-
# --- 6. Train the Final Model ---
|
| 91 |
-
print("\nTraining final model...")
|
| 92 |
-
best_model = pipeline.set_params(**best_params)
|
| 93 |
-
best_model.fit(X_train, y_train)
|
| 94 |
-
print("Model training complete.")
|
| 95 |
-
|
| 96 |
-
# --- 7. Evaluate the Best Model ---
|
| 97 |
-
print("\n--- Model Evaluation ---")
|
| 98 |
-
y_pred = best_model.predict(X_test)
|
| 99 |
-
|
| 100 |
-
accuracy = accuracy_score(y_test, y_pred)
|
| 101 |
-
target_names = ['Negative', 'Positive']
|
| 102 |
-
|
| 103 |
-
print(f"Accuracy: {accuracy:.4f}")
|
| 104 |
-
print("\nClassification Report:")
|
| 105 |
-
print(classification_report(y_test, y_pred, target_names=target_names))
|
| 106 |
-
|
| 107 |
-
print("Confusion Matrix:")
|
| 108 |
-
cm = confusion_matrix(y_test, y_pred)
|
| 109 |
-
plt.figure(figsize=(8, 6))
|
| 110 |
-
sns.heatmap(cm, annot=True, fmt='d', cmap='Greens',
|
| 111 |
-
xticklabels=target_names, yticklabels=target_names)
|
| 112 |
-
plt.title('Confusion Matrix for Naive Bayes on Amazon Reviews')
|
| 113 |
-
plt.xlabel('Predicted Label')
|
| 114 |
-
plt.ylabel('True Label')
|
| 115 |
-
plt.show()
|
| 116 |
-
|
| 117 |
-
if __name__ == "__main__":
|
| 118 |
-
train_baseline_sentiment_model(sample_size=150000, grid_search=False)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|