Spaces:
Sleeping
Sleeping
Commit ·
0116d50
1
Parent(s): 9f0f097
Implement core pipelines and web UI
Browse files- .gitignore +16 -0
- README.md +56 -16
- app.py +67 -0
- artifacts/cleaned_data/.gitkeep +0 -0
- artifacts/clustering/.gitkeep +0 -0
- artifacts/embeddings/.gitkeep +0 -0
- artifacts/models/.gitkeep +0 -0
- artifacts/raw_data/README.md +5 -0
- artifacts/summaries/.gitkeep +0 -0
- notebooks/EDA.ipynb +15 -3
- src/components/clustering_engine.py +30 -0
- src/components/data_cleaning.py +45 -0
- src/components/data_loader.py +45 -0
- src/components/embedding_generator.py +40 -0
- src/components/query_engine.py +37 -0
- src/components/summarization_engine.py +45 -0
- src/components/visualization.py +33 -0
- src/config/cluster_config.json +6 -0
- src/config/config.yaml +14 -0
- src/config/model_config.json +6 -0
- src/pipelines/build_embeddings_pipeline.py +42 -0
- src/pipelines/clustering_pipeline.py +42 -0
- src/pipelines/full_run_pipeline.py +19 -0
- src/pipelines/query_pipeline.py +38 -0
- src/pipelines/summarization_pipeline.py +47 -0
- src/utils/__pycache__/exception.cpython-39.pyc +0 -0
- src/utils/__pycache__/file_utils.cpython-39.pyc +0 -0
- src/utils/__pycache__/logger.cpython-39.pyc +0 -0
- src/utils/__pycache__/plot_utils.cpython-39.pyc +0 -0
- src/utils/__pycache__/text_utils.cpython-39.pyc +0 -0
- static/styles.css +59 -0
- templates/index.html +35 -0
- templates/results.html +29 -0
.gitignore
CHANGED
|
@@ -1,5 +1,6 @@
|
|
| 1 |
|
| 2 |
venv/
|
|
|
|
| 3 |
.env/
|
| 4 |
|
| 5 |
|
|
@@ -18,6 +19,21 @@ artifacts/*
|
|
| 18 |
!artifacts/summaries/
|
| 19 |
!artifacts/models/
|
| 20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
# Local datasets too large for GitHub
|
| 22 |
artifacts/raw_data/Electronics_5\ 2.json
|
| 23 |
|
|
|
|
| 1 |
|
| 2 |
venv/
|
| 3 |
+
.venv/
|
| 4 |
.env/
|
| 5 |
|
| 6 |
|
|
|
|
| 19 |
!artifacts/summaries/
|
| 20 |
!artifacts/models/
|
| 21 |
|
| 22 |
+
# Keep directory structure but ignore generated assets
|
| 23 |
+
artifacts/raw_data/*
|
| 24 |
+
!artifacts/raw_data/.gitkeep
|
| 25 |
+
!artifacts/raw_data/README.md
|
| 26 |
+
artifacts/cleaned_data/*
|
| 27 |
+
!artifacts/cleaned_data/.gitkeep
|
| 28 |
+
artifacts/embeddings/*
|
| 29 |
+
!artifacts/embeddings/.gitkeep
|
| 30 |
+
artifacts/clustering/*
|
| 31 |
+
!artifacts/clustering/.gitkeep
|
| 32 |
+
artifacts/summaries/*
|
| 33 |
+
!artifacts/summaries/.gitkeep
|
| 34 |
+
artifacts/models/*
|
| 35 |
+
!artifacts/models/.gitkeep
|
| 36 |
+
|
| 37 |
# Local datasets too large for GitHub
|
| 38 |
artifacts/raw_data/Electronics_5\ 2.json
|
| 39 |
|
README.md
CHANGED
|
@@ -1,18 +1,58 @@
|
|
| 1 |
# Opinion-Summarizer-NLP
|
| 2 |
|
| 3 |
-
An end-to-end
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
#
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Opinion-Summarizer-NLP
|
| 2 |
|
| 3 |
+
An end-to-end workflow that turns raw Amazon electronics reviews into compact opinion summaries and a lightweight semantic search experience.
|
| 4 |
+
|
| 5 |
+
## Project layout
|
| 6 |
+
|
| 7 |
+
```
|
| 8 |
+
├── src/components # modular data/ML building blocks
|
| 9 |
+
├── src/pipelines # executable steps (load→embed→cluster→summarise)
|
| 10 |
+
├── artifacts/ # generated assets (clean data, embeddings, etc.)
|
| 11 |
+
├── templates/ + static/ # Flask UI
|
| 12 |
+
└── notebooks/EDA.ipynb # exploratory analysis walkthrough
|
| 13 |
+
```
|
| 14 |
+
|
| 15 |
+
## Getting started
|
| 16 |
+
|
| 17 |
+
1. **Install dependencies**
|
| 18 |
+
```bash
|
| 19 |
+
python -m venv venv
|
| 20 |
+
source venv/bin/activate
|
| 21 |
+
pip install -r requirements.txt
|
| 22 |
+
```
|
| 23 |
+
2. **Place the sampled dataset** at `artifacts/raw_data/electronics_sample_50k.json`. This should be a JSONL file where each line is a review dict from the Amazon Electronics dataset.
|
| 24 |
+
3. **Generate assets**
|
| 25 |
+
```bash
|
| 26 |
+
python -m src.pipelines.build_embeddings_pipeline
|
| 27 |
+
python -m src.pipelines.clustering_pipeline
|
| 28 |
+
python -m src.pipelines.summarization_pipeline
|
| 29 |
+
```
|
| 30 |
+
or simply run `python -m src.pipelines.full_run_pipeline` to execute all three.
|
| 31 |
+
4. **Launch the app**
|
| 32 |
+
```bash
|
| 33 |
+
flask --app app run --port 8000
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
## Pipelines
|
| 37 |
+
|
| 38 |
+
| Step | Purpose | Output |
|
| 39 |
+
| --- | --- | --- |
|
| 40 |
+
| `build_embeddings_pipeline` | load → clean → embed reviews | `artifacts/cleaned_data/*.parquet`, `artifacts/embeddings/*.npy` |
|
| 41 |
+
| `clustering_pipeline` | group reviews by semantic similarity | `artifacts/clustering/cluster_labels.csv` |
|
| 42 |
+
| `summarization_pipeline` | produce abstractive summary per cluster | `artifacts/summaries/cluster_summaries.json` |
|
| 43 |
+
|
| 44 |
+
## Web interface
|
| 45 |
+
|
| 46 |
+
The Flask app exposes:
|
| 47 |
+
- `/` overview page with the most recent cluster summaries
|
| 48 |
+
- `/results` POST route to run semantic search over the indexed reviews
|
| 49 |
+
|
| 50 |
+
Static styling lives in `static/styles.css`; HTML templates sit in `templates/`.
|
| 51 |
+
|
| 52 |
+
## Notebook
|
| 53 |
+
|
| 54 |
+
`notebooks/EDA.ipynb` reproduces the exploratory plots (length distributions, word clouds, rating histograms, etc.) over the sampled 50k reviews.
|
| 55 |
+
|
| 56 |
+
## Configuration
|
| 57 |
+
|
| 58 |
+
Tune paths and hyper-parameters inside `src/config/config.yaml`, `src/config/model_config.json`, and `src/config/cluster_config.json`.
|
app.py
CHANGED
|
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Flask entrypoint for the Opinion Summarizer demo."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import json
|
| 6 |
+
from dataclasses import dataclass
|
| 7 |
+
from functools import lru_cache
|
| 8 |
+
from pathlib import Path
|
| 9 |
+
from typing import List
|
| 10 |
+
|
| 11 |
+
import numpy as np
|
| 12 |
+
import pandas as pd
|
| 13 |
+
import yaml
|
| 14 |
+
from flask import Flask, redirect, render_template, request, url_for
|
| 15 |
+
|
| 16 |
+
from src.components.query_engine import QueryEngine
|
| 17 |
+
|
| 18 |
+
app = Flask(__name__)
|
| 19 |
+
CONFIG_PATH = Path("src/config/config.yaml")
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
@dataclass
|
| 23 |
+
class QueryResult:
|
| 24 |
+
text: str
|
| 25 |
+
score: float
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
@lru_cache(maxsize=1)
|
| 29 |
+
def load_config():
|
| 30 |
+
return yaml.safe_load(CONFIG_PATH.read_text())
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
@lru_cache(maxsize=1)
|
| 34 |
+
def load_query_engine() -> QueryEngine:
|
| 35 |
+
config = load_config()
|
| 36 |
+
data_cfg = config["data"]
|
| 37 |
+
embeddings = np.load(data_cfg["embeddings_path"])
|
| 38 |
+
df = pd.read_parquet(data_cfg["cleaned_path"])
|
| 39 |
+
return QueryEngine(embeddings=embeddings, documents=df["reviewText"].tolist())
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
def load_cluster_summaries():
|
| 43 |
+
config = load_config()
|
| 44 |
+
summary_path = Path(config["data"]["summaries_path"])
|
| 45 |
+
if summary_path.exists():
|
| 46 |
+
return json.loads(summary_path.read_text())
|
| 47 |
+
return []
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
@app.route("/")
|
| 51 |
+
def home():
|
| 52 |
+
return render_template("index.html", summaries=load_cluster_summaries())
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
@app.route("/results", methods=["POST"])
|
| 56 |
+
def results():
|
| 57 |
+
query = request.form.get("query", "").strip()
|
| 58 |
+
if not query:
|
| 59 |
+
return redirect(url_for("home"))
|
| 60 |
+
engine = load_query_engine()
|
| 61 |
+
matches = engine.search(query)
|
| 62 |
+
results = [QueryResult(text=doc, score=score) for doc, score in matches]
|
| 63 |
+
return render_template("results.html", query=query, results=results)
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
if __name__ == "__main__":
|
| 67 |
+
app.run(debug=True, port=8000)
|
artifacts/cleaned_data/.gitkeep
ADDED
|
File without changes
|
artifacts/clustering/.gitkeep
ADDED
|
File without changes
|
artifacts/embeddings/.gitkeep
ADDED
|
File without changes
|
artifacts/models/.gitkeep
ADDED
|
File without changes
|
artifacts/raw_data/README.md
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Raw data placeholder
|
| 2 |
+
|
| 3 |
+
Drop your sampled Amazon Electronics JSONL file in this folder.
|
| 4 |
+
The pipelines expect a file named `electronics_sample_50k.json`.
|
| 5 |
+
These files are ignored by git so you can keep large datasets locally.
|
artifacts/summaries/.gitkeep
ADDED
|
File without changes
|
notebooks/EDA.ipynb
CHANGED
|
@@ -20,10 +20,22 @@
|
|
| 20 |
},
|
| 21 |
{
|
| 22 |
"cell_type": "code",
|
| 23 |
-
"execution_count":
|
| 24 |
"id": "8faa2d0f",
|
| 25 |
"metadata": {},
|
| 26 |
-
"outputs": [
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
"source": [
|
| 28 |
"import json\n",
|
| 29 |
"import pandas as pd\n",
|
|
@@ -285,7 +297,7 @@
|
|
| 285 |
"name": "python",
|
| 286 |
"nbconvert_exporter": "python",
|
| 287 |
"pygments_lexer": "ipython3",
|
| 288 |
-
"version": "3.9"
|
| 289 |
}
|
| 290 |
},
|
| 291 |
"nbformat": 4,
|
|
|
|
| 20 |
},
|
| 21 |
{
|
| 22 |
"cell_type": "code",
|
| 23 |
+
"execution_count": 1,
|
| 24 |
"id": "8faa2d0f",
|
| 25 |
"metadata": {},
|
| 26 |
+
"outputs": [
|
| 27 |
+
{
|
| 28 |
+
"ename": "ModuleNotFoundError",
|
| 29 |
+
"evalue": "No module named 'seaborn'",
|
| 30 |
+
"output_type": "error",
|
| 31 |
+
"traceback": [
|
| 32 |
+
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
| 33 |
+
"\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)",
|
| 34 |
+
"Cell \u001b[0;32mIn[1], line 4\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mpandas\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mas\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mpd\u001b[39;00m\n\u001b[1;32m 3\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mmatplotlib\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mpyplot\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mas\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mplt\u001b[39;00m\n\u001b[0;32m----> 4\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mseaborn\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mas\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01msns\u001b[39;00m\n\u001b[1;32m 5\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mwordcloud\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m WordCloud\n\u001b[1;32m 6\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mcollections\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m Counter\n",
|
| 35 |
+
"\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'seaborn'"
|
| 36 |
+
]
|
| 37 |
+
}
|
| 38 |
+
],
|
| 39 |
"source": [
|
| 40 |
"import json\n",
|
| 41 |
"import pandas as pd\n",
|
|
|
|
| 297 |
"name": "python",
|
| 298 |
"nbconvert_exporter": "python",
|
| 299 |
"pygments_lexer": "ipython3",
|
| 300 |
+
"version": "3.9.25"
|
| 301 |
}
|
| 302 |
},
|
| 303 |
"nbformat": 4,
|
src/components/clustering_engine.py
CHANGED
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Clustering helpers for grouping similar reviews."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
from dataclasses import dataclass
|
| 6 |
+
from typing import Optional
|
| 7 |
+
|
| 8 |
+
import numpy as np
|
| 9 |
+
from sklearn.cluster import KMeans
|
| 10 |
+
from sklearn.decomposition import PCA
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
@dataclass
|
| 14 |
+
class ClusteringEngine:
|
| 15 |
+
n_clusters: int = 20
|
| 16 |
+
random_state: int = 42
|
| 17 |
+
use_pca: bool = True
|
| 18 |
+
pca_components: Optional[int] = 50
|
| 19 |
+
|
| 20 |
+
def fit_predict(self, embeddings: np.ndarray) -> np.ndarray:
|
| 21 |
+
matrix = embeddings
|
| 22 |
+
if self.use_pca and self.pca_components and matrix.shape[1] > self.pca_components:
|
| 23 |
+
reducer = PCA(n_components=self.pca_components, random_state=self.random_state)
|
| 24 |
+
matrix = reducer.fit_transform(matrix)
|
| 25 |
+
model = KMeans(n_clusters=self.n_clusters, random_state=self.random_state, n_init="auto")
|
| 26 |
+
labels = model.fit_predict(matrix)
|
| 27 |
+
return labels
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
__all__ = ["ClusteringEngine"]
|
src/components/data_cleaning.py
CHANGED
|
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Text cleaning routines for the review corpus."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import re
|
| 6 |
+
from dataclasses import dataclass
|
| 7 |
+
from typing import Iterable, List
|
| 8 |
+
|
| 9 |
+
import pandas as pd
|
| 10 |
+
|
| 11 |
+
HTML_TAG_RE = re.compile(r"<[^>]+>")
|
| 12 |
+
NON_ALPHA_RE = re.compile(r"[^a-zA-Z0-9\s]")
|
| 13 |
+
MULTISPACE_RE = re.compile(r"\s+")
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
@dataclass
|
| 17 |
+
class ReviewCleaner:
|
| 18 |
+
lowercase: bool = True
|
| 19 |
+
|
| 20 |
+
def clean(self, text: str) -> str:
|
| 21 |
+
if not isinstance(text, str):
|
| 22 |
+
text = ""
|
| 23 |
+
if self.lowercase:
|
| 24 |
+
text = text.lower()
|
| 25 |
+
text = HTML_TAG_RE.sub(" ", text)
|
| 26 |
+
text = NON_ALPHA_RE.sub(" ", text)
|
| 27 |
+
text = MULTISPACE_RE.sub(" ", text)
|
| 28 |
+
return text.strip()
|
| 29 |
+
|
| 30 |
+
def clean_series(self, series: pd.Series) -> pd.Series:
|
| 31 |
+
return series.fillna("").map(self.clean)
|
| 32 |
+
|
| 33 |
+
def remove_short_reviews(self, df: pd.DataFrame, min_chars: int = 20) -> pd.DataFrame:
|
| 34 |
+
mask = df["reviewText"].str.len() >= min_chars
|
| 35 |
+
return df.loc[mask].copy()
|
| 36 |
+
|
| 37 |
+
def __call__(self, df: pd.DataFrame, min_chars: int = 20) -> pd.DataFrame:
|
| 38 |
+
df = df.copy()
|
| 39 |
+
df["clean_text"] = self.clean_series(df["reviewText"])
|
| 40 |
+
df = self.remove_short_reviews(df, min_chars=min_chars)
|
| 41 |
+
df = df.drop_duplicates(subset=["clean_text"])
|
| 42 |
+
return df.reset_index(drop=True)
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
__all__ = ["ReviewCleaner"]
|
src/components/data_loader.py
CHANGED
|
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Utilities for loading raw Amazon electronics review data."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
from dataclasses import dataclass
|
| 6 |
+
from pathlib import Path
|
| 7 |
+
from typing import Iterable, Optional
|
| 8 |
+
|
| 9 |
+
import pandas as pd
|
| 10 |
+
|
| 11 |
+
from src.utils.exception import CustomException
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
@dataclass
|
| 15 |
+
class ReviewDatasetLoader:
|
| 16 |
+
"""Load JSON-lines review dumps with optional sampling."""
|
| 17 |
+
|
| 18 |
+
data_path: Path
|
| 19 |
+
sample_size: Optional[int] = None
|
| 20 |
+
random_state: int = 42
|
| 21 |
+
|
| 22 |
+
def _read_jsonl(self) -> Iterable[dict]:
|
| 23 |
+
if not self.data_path.exists():
|
| 24 |
+
raise CustomException(f"Dataset not found at {self.data_path}")
|
| 25 |
+
import json
|
| 26 |
+
|
| 27 |
+
with self.data_path.open("r", encoding="utf-8") as handle:
|
| 28 |
+
for line in handle:
|
| 29 |
+
line = line.strip()
|
| 30 |
+
if line:
|
| 31 |
+
yield json.loads(line)
|
| 32 |
+
|
| 33 |
+
def load(self) -> pd.DataFrame:
|
| 34 |
+
records = list(self._read_jsonl())
|
| 35 |
+
if not records:
|
| 36 |
+
raise CustomException("Dataset file is empty")
|
| 37 |
+
df = pd.DataFrame(records)
|
| 38 |
+
df = df.dropna(subset=["reviewText"]).reset_index(drop=True)
|
| 39 |
+
if self.sample_size and len(df) > self.sample_size:
|
| 40 |
+
df = df.sample(self.sample_size, random_state=self.random_state)
|
| 41 |
+
df["reviewText"] = df["reviewText"].astype(str)
|
| 42 |
+
return df
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
__all__ = ["ReviewDatasetLoader"]
|
src/components/embedding_generator.py
CHANGED
|
@@ -0,0 +1,40 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Sentence embedding generation utilities."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
from dataclasses import dataclass
|
| 6 |
+
from pathlib import Path
|
| 7 |
+
from typing import Iterable, List
|
| 8 |
+
|
| 9 |
+
import numpy as np
|
| 10 |
+
from sentence_transformers import SentenceTransformer
|
| 11 |
+
from tqdm.auto import tqdm
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
@dataclass
|
| 15 |
+
class EmbeddingGenerator:
|
| 16 |
+
model_name: str = "sentence-transformers/all-MiniLM-L6-v2"
|
| 17 |
+
batch_size: int = 64
|
| 18 |
+
normalize: bool = True
|
| 19 |
+
|
| 20 |
+
def __post_init__(self) -> None:
|
| 21 |
+
self.model = SentenceTransformer(self.model_name)
|
| 22 |
+
|
| 23 |
+
def encode(self, texts: Iterable[str]) -> np.ndarray:
|
| 24 |
+
embeddings: List[np.ndarray] = []
|
| 25 |
+
batch: List[str] = []
|
| 26 |
+
for text in texts:
|
| 27 |
+
batch.append(text)
|
| 28 |
+
if len(batch) == self.batch_size:
|
| 29 |
+
embeddings.append(self.model.encode(batch, normalize_embeddings=self.normalize))
|
| 30 |
+
batch = []
|
| 31 |
+
if batch:
|
| 32 |
+
embeddings.append(self.model.encode(batch, normalize_embeddings=self.normalize))
|
| 33 |
+
return np.vstack(embeddings)
|
| 34 |
+
|
| 35 |
+
def save(self, embeddings: np.ndarray, path: Path) -> None:
|
| 36 |
+
path.parent.mkdir(parents=True, exist_ok=True)
|
| 37 |
+
np.save(path, embeddings)
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
__all__ = ["EmbeddingGenerator"]
|
src/components/query_engine.py
CHANGED
|
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Simple semantic search over review embeddings."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
from dataclasses import dataclass
|
| 6 |
+
from typing import List, Sequence, Tuple
|
| 7 |
+
|
| 8 |
+
import numpy as np
|
| 9 |
+
from sklearn.neighbors import NearestNeighbors
|
| 10 |
+
from sentence_transformers import SentenceTransformer
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
@dataclass
|
| 14 |
+
class QueryEngine:
|
| 15 |
+
embeddings: np.ndarray
|
| 16 |
+
documents: Sequence[str]
|
| 17 |
+
model_name: str = "sentence-transformers/all-MiniLM-L6-v2"
|
| 18 |
+
top_k: int = 5
|
| 19 |
+
|
| 20 |
+
def __post_init__(self) -> None:
|
| 21 |
+
if len(self.documents) != len(self.embeddings):
|
| 22 |
+
raise ValueError("Embeddings and documents must be aligned")
|
| 23 |
+
self.model = SentenceTransformer(self.model_name)
|
| 24 |
+
self.index = NearestNeighbors(metric="cosine")
|
| 25 |
+
self.index.fit(self.embeddings)
|
| 26 |
+
|
| 27 |
+
def search(self, query: str) -> List[Tuple[str, float]]:
|
| 28 |
+
query_emb = self.model.encode([query])
|
| 29 |
+
distances, indices = self.index.kneighbors(query_emb, n_neighbors=self.top_k)
|
| 30 |
+
results = []
|
| 31 |
+
for dist, idx in zip(distances[0], indices[0]):
|
| 32 |
+
similarity = 1 - dist
|
| 33 |
+
results.append((self.documents[idx], float(similarity)))
|
| 34 |
+
return results
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
__all__ = ["QueryEngine"]
|
src/components/summarization_engine.py
CHANGED
|
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Cluster-level abstractive summarisation helpers."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
from dataclasses import dataclass
|
| 6 |
+
from typing import Iterable
|
| 7 |
+
|
| 8 |
+
from transformers import pipeline
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
@dataclass
|
| 12 |
+
class SummarizationEngine:
|
| 13 |
+
model_name: str = "google/pegasus-xsum"
|
| 14 |
+
max_length: int = 128
|
| 15 |
+
min_length: int = 32
|
| 16 |
+
max_input_chars: int = 6000
|
| 17 |
+
max_reviews: int = 100
|
| 18 |
+
|
| 19 |
+
def __post_init__(self) -> None:
|
| 20 |
+
self._pipeline = pipeline(
|
| 21 |
+
"summarization",
|
| 22 |
+
model=self.model_name,
|
| 23 |
+
tokenizer=self.model_name,
|
| 24 |
+
)
|
| 25 |
+
|
| 26 |
+
def summarize(self, texts: Iterable[str]) -> str:
|
| 27 |
+
cleaned = [text.strip() for text in texts if text and text.strip()]
|
| 28 |
+
if not cleaned:
|
| 29 |
+
return ""
|
| 30 |
+
if self.max_reviews:
|
| 31 |
+
cleaned = cleaned[: self.max_reviews]
|
| 32 |
+
joined = " ".join(cleaned)
|
| 33 |
+
if len(joined) > self.max_input_chars:
|
| 34 |
+
joined = joined[: self.max_input_chars]
|
| 35 |
+
output = self._pipeline(
|
| 36 |
+
joined,
|
| 37 |
+
max_length=self.max_length,
|
| 38 |
+
min_length=self.min_length,
|
| 39 |
+
do_sample=False,
|
| 40 |
+
truncation=True,
|
| 41 |
+
)
|
| 42 |
+
return output[0]["summary_text"].strip()
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
__all__ = ["SummarizationEngine"]
|
src/components/visualization.py
CHANGED
|
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Utility plots for exploratory analysis."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import matplotlib.pyplot as plt
|
| 6 |
+
import seaborn as sns
|
| 7 |
+
import pandas as pd
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
sns.set_style("whitegrid")
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
def plot_rating_distribution(df: pd.DataFrame):
|
| 14 |
+
if "overall" not in df.columns:
|
| 15 |
+
raise ValueError("Column 'overall' not present")
|
| 16 |
+
plt.figure(figsize=(7, 4))
|
| 17 |
+
sns.countplot(x="overall", data=df, palette="viridis")
|
| 18 |
+
plt.title("Ratings distribution")
|
| 19 |
+
return plt.gca()
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
def plot_cluster_sizes(labels):
|
| 23 |
+
series = pd.Series(labels)
|
| 24 |
+
counts = series.value_counts().sort_index()
|
| 25 |
+
plt.figure(figsize=(10, 4))
|
| 26 |
+
counts.plot(kind="bar", color="#0b7fab")
|
| 27 |
+
plt.title("Cluster sizes")
|
| 28 |
+
plt.xlabel("Cluster id")
|
| 29 |
+
plt.ylabel("# Reviews")
|
| 30 |
+
return plt.gca()
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
__all__ = ["plot_rating_distribution", "plot_cluster_sizes"]
|
src/config/cluster_config.json
CHANGED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"n_clusters": 20,
|
| 3 |
+
"use_pca": true,
|
| 4 |
+
"pca_components": 50,
|
| 5 |
+
"random_state": 42
|
| 6 |
+
}
|
src/config/config.yaml
CHANGED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
data:
|
| 2 |
+
raw_path: artifacts/raw_data/electronics_sample_50k.json
|
| 3 |
+
cleaned_path: artifacts/cleaned_data/clean_reviews.parquet
|
| 4 |
+
embeddings_path: artifacts/embeddings/review_embeddings.npy
|
| 5 |
+
cluster_assignments_path: artifacts/clustering/cluster_labels.csv
|
| 6 |
+
summaries_path: artifacts/summaries/cluster_summaries.json
|
| 7 |
+
models:
|
| 8 |
+
embedding_model: sentence-transformers/all-MiniLM-L6-v2
|
| 9 |
+
summarizer_model: google/pegasus-xsum
|
| 10 |
+
clustering:
|
| 11 |
+
n_clusters: 20
|
| 12 |
+
pca_components: 50
|
| 13 |
+
app:
|
| 14 |
+
results_cache: artifacts/summaries/cluster_summaries.json
|
src/config/model_config.json
CHANGED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
|
| 3 |
+
"summarizer_model": "google/pegasus-xsum",
|
| 4 |
+
"max_summary_length": 128,
|
| 5 |
+
"min_summary_length": 32
|
| 6 |
+
}
|
src/pipelines/build_embeddings_pipeline.py
CHANGED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""CLI to build review embeddings from the raw dataset."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import argparse
|
| 6 |
+
import json
|
| 7 |
+
from pathlib import Path
|
| 8 |
+
|
| 9 |
+
import numpy as np
|
| 10 |
+
import yaml
|
| 11 |
+
|
| 12 |
+
from src.components.data_loader import ReviewDatasetLoader
|
| 13 |
+
from src.components.data_cleaning import ReviewCleaner
|
| 14 |
+
from src.components.embedding_generator import EmbeddingGenerator
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
def run(config_path: str = "src/config/config.yaml") -> None:
|
| 18 |
+
config = yaml.safe_load(Path(config_path).read_text())
|
| 19 |
+
data_cfg = config["data"]
|
| 20 |
+
model_cfg = config["models"]
|
| 21 |
+
|
| 22 |
+
loader = ReviewDatasetLoader(Path(data_cfg["raw_path"]))
|
| 23 |
+
cleaner = ReviewCleaner()
|
| 24 |
+
df = cleaner(loader.load())
|
| 25 |
+
|
| 26 |
+
generator = EmbeddingGenerator(model_name=model_cfg["embedding_model"])
|
| 27 |
+
embeddings = generator.encode(df["clean_text"].tolist())
|
| 28 |
+
generator.save(embeddings, Path(data_cfg["embeddings_path"]))
|
| 29 |
+
|
| 30 |
+
df[["reviewText", "clean_text"]].to_parquet(data_cfg["cleaned_path"], index=False)
|
| 31 |
+
print(f"Saved {len(df)} cleaned reviews and embeddings -> {data_cfg['embeddings_path']}")
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def cli() -> None:
|
| 35 |
+
parser = argparse.ArgumentParser(description=__doc__)
|
| 36 |
+
parser.add_argument("--config", default="src/config/config.yaml")
|
| 37 |
+
args = parser.parse_args()
|
| 38 |
+
run(args.config)
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
if __name__ == "__main__":
|
| 42 |
+
cli()
|
src/pipelines/clustering_pipeline.py
CHANGED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Cluster review embeddings and persist cluster labels."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import argparse
|
| 6 |
+
from pathlib import Path
|
| 7 |
+
|
| 8 |
+
import pandas as pd
|
| 9 |
+
import numpy as np
|
| 10 |
+
import yaml
|
| 11 |
+
|
| 12 |
+
from src.components.clustering_engine import ClusteringEngine
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
def run(config_path: str = "src/config/config.yaml") -> None:
|
| 16 |
+
config = yaml.safe_load(Path(config_path).read_text())
|
| 17 |
+
data_cfg = config["data"]
|
| 18 |
+
cluster_cfg = config["clustering"]
|
| 19 |
+
|
| 20 |
+
embeddings = np.load(data_cfg["embeddings_path"])
|
| 21 |
+
engine = ClusteringEngine(
|
| 22 |
+
n_clusters=cluster_cfg["n_clusters"],
|
| 23 |
+
use_pca=cluster_cfg.get("use_pca", True),
|
| 24 |
+
pca_components=cluster_cfg.get("pca_components"),
|
| 25 |
+
)
|
| 26 |
+
labels = engine.fit_predict(embeddings)
|
| 27 |
+
|
| 28 |
+
df = pd.read_parquet(data_cfg["cleaned_path"])
|
| 29 |
+
df["cluster_id"] = labels
|
| 30 |
+
df.to_csv(data_cfg["cluster_assignments_path"], index=False)
|
| 31 |
+
print(f"Clustered {len(df)} reviews into {labels.max()+1} clusters")
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def cli() -> None:
|
| 35 |
+
parser = argparse.ArgumentParser(description=__doc__)
|
| 36 |
+
parser.add_argument("--config", default="src/config/config.yaml")
|
| 37 |
+
args = parser.parse_args()
|
| 38 |
+
run(args.config)
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
if __name__ == "__main__":
|
| 42 |
+
cli()
|
src/pipelines/full_run_pipeline.py
CHANGED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Execute the full offline pipeline: load -> clean -> embed -> cluster -> summarize."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
from pathlib import Path
|
| 6 |
+
|
| 7 |
+
from src.pipelines.build_embeddings_pipeline import run as build_embeddings
|
| 8 |
+
from src.pipelines.clustering_pipeline import run as cluster_reviews
|
| 9 |
+
from src.pipelines.summarization_pipeline import run as summarize_clusters
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
def run(config_path: str = "src/config/config.yaml") -> None:
|
| 13 |
+
build_embeddings(config_path)
|
| 14 |
+
cluster_reviews(config_path)
|
| 15 |
+
summarize_clusters(config_path)
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
if __name__ == "__main__":
|
| 19 |
+
run()
|
src/pipelines/query_pipeline.py
CHANGED
|
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Expose a CLI helper to run semantic search queries."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import argparse
|
| 6 |
+
import json
|
| 7 |
+
from pathlib import Path
|
| 8 |
+
|
| 9 |
+
import numpy as np
|
| 10 |
+
import pandas as pd
|
| 11 |
+
import yaml
|
| 12 |
+
|
| 13 |
+
from src.components.query_engine import QueryEngine
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
def run(query: str, config_path: str = "src/config/config.yaml") -> None:
|
| 17 |
+
config = yaml.safe_load(Path(config_path).read_text())
|
| 18 |
+
data_cfg = config["data"]
|
| 19 |
+
embeddings = np.load(data_cfg["embeddings_path"])
|
| 20 |
+
df = pd.read_parquet(data_cfg["cleaned_path"])
|
| 21 |
+
|
| 22 |
+
engine = QueryEngine(embeddings=embeddings, documents=df["reviewText"].tolist())
|
| 23 |
+
results = engine.search(query)
|
| 24 |
+
|
| 25 |
+
for rank, (text, score) in enumerate(results, start=1):
|
| 26 |
+
print(f"[{rank}] score={score:.3f}\n{text}\n")
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
def cli() -> None:
|
| 30 |
+
parser = argparse.ArgumentParser(description=__doc__)
|
| 31 |
+
parser.add_argument("query", help="Natural language query to look up")
|
| 32 |
+
parser.add_argument("--config", default="src/config/config.yaml")
|
| 33 |
+
args = parser.parse_args()
|
| 34 |
+
run(args.query, args.config)
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
if __name__ == "__main__":
|
| 38 |
+
cli()
|
src/pipelines/summarization_pipeline.py
CHANGED
|
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Generate abstractive summaries for each review cluster."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import argparse
|
| 6 |
+
import json
|
| 7 |
+
from pathlib import Path
|
| 8 |
+
|
| 9 |
+
import pandas as pd
|
| 10 |
+
import yaml
|
| 11 |
+
|
| 12 |
+
from src.components.summarization_engine import SummarizationEngine
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
def run(config_path: str = "src/config/config.yaml") -> None:
|
| 16 |
+
config = yaml.safe_load(Path(config_path).read_text())
|
| 17 |
+
data_cfg = config["data"]
|
| 18 |
+
model_cfg = config["models"]
|
| 19 |
+
|
| 20 |
+
df = pd.read_csv(data_cfg["cluster_assignments_path"])
|
| 21 |
+
engine = SummarizationEngine(
|
| 22 |
+
model_name=model_cfg["summarizer_model"],
|
| 23 |
+
max_length=model_cfg.get("max_summary_length", 128),
|
| 24 |
+
min_length=model_cfg.get("min_summary_length", 32),
|
| 25 |
+
)
|
| 26 |
+
|
| 27 |
+
summaries = []
|
| 28 |
+
for cluster_id, group in df.groupby("cluster_id"):
|
| 29 |
+
summary = engine.summarize(group["reviewText"].tolist()[:200])
|
| 30 |
+
summaries.append({
|
| 31 |
+
"cluster_id": int(cluster_id),
|
| 32 |
+
"summary": summary,
|
| 33 |
+
"size": int(len(group)),
|
| 34 |
+
})
|
| 35 |
+
Path(data_cfg["summaries_path"]).write_text(json.dumps(summaries, indent=2))
|
| 36 |
+
print(f"Wrote {len(summaries)} cluster summaries -> {data_cfg['summaries_path']}")
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
def cli() -> None:
|
| 40 |
+
parser = argparse.ArgumentParser(description=__doc__)
|
| 41 |
+
parser.add_argument("--config", default="src/config/config.yaml")
|
| 42 |
+
args = parser.parse_args()
|
| 43 |
+
run(args.config)
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
if __name__ == "__main__":
|
| 47 |
+
cli()
|
src/utils/__pycache__/exception.cpython-39.pyc
DELETED
|
Binary file (510 Bytes)
|
|
|
src/utils/__pycache__/file_utils.cpython-39.pyc
DELETED
|
Binary file (1.01 kB)
|
|
|
src/utils/__pycache__/logger.cpython-39.pyc
DELETED
|
Binary file (581 Bytes)
|
|
|
src/utils/__pycache__/plot_utils.cpython-39.pyc
DELETED
|
Binary file (680 Bytes)
|
|
|
src/utils/__pycache__/text_utils.cpython-39.pyc
DELETED
|
Binary file (590 Bytes)
|
|
|
static/styles.css
CHANGED
|
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
body {
|
| 2 |
+
font-family: "Inter", "Segoe UI", sans-serif;
|
| 3 |
+
margin: 0;
|
| 4 |
+
padding: 0;
|
| 5 |
+
background: #f4f6fb;
|
| 6 |
+
color: #1f2933;
|
| 7 |
+
}
|
| 8 |
+
|
| 9 |
+
header {
|
| 10 |
+
background: linear-gradient(120deg, #0b7fab, #3aa17e);
|
| 11 |
+
color: white;
|
| 12 |
+
padding: 2rem;
|
| 13 |
+
text-align: center;
|
| 14 |
+
}
|
| 15 |
+
|
| 16 |
+
main {
|
| 17 |
+
max-width: 960px;
|
| 18 |
+
margin: 2rem auto;
|
| 19 |
+
background: white;
|
| 20 |
+
border-radius: 12px;
|
| 21 |
+
box-shadow: 0 20px 40px rgba(15, 23, 42, 0.1);
|
| 22 |
+
padding: 2rem 3rem;
|
| 23 |
+
}
|
| 24 |
+
|
| 25 |
+
form textarea {
|
| 26 |
+
width: 100%;
|
| 27 |
+
min-height: 140px;
|
| 28 |
+
border-radius: 8px;
|
| 29 |
+
border: 1px solid #d3dae6;
|
| 30 |
+
padding: 1rem;
|
| 31 |
+
font-size: 1rem;
|
| 32 |
+
resize: vertical;
|
| 33 |
+
}
|
| 34 |
+
|
| 35 |
+
button {
|
| 36 |
+
background: #0b7fab;
|
| 37 |
+
border: none;
|
| 38 |
+
color: white;
|
| 39 |
+
font-size: 1rem;
|
| 40 |
+
padding: 0.9rem 1.8rem;
|
| 41 |
+
border-radius: 999px;
|
| 42 |
+
cursor: pointer;
|
| 43 |
+
margin-top: 1rem;
|
| 44 |
+
}
|
| 45 |
+
|
| 46 |
+
button:hover {
|
| 47 |
+
background: #095c7d;
|
| 48 |
+
}
|
| 49 |
+
|
| 50 |
+
.summary-card {
|
| 51 |
+
border: 1px solid #e5e9f2;
|
| 52 |
+
border-radius: 10px;
|
| 53 |
+
padding: 1rem 1.5rem;
|
| 54 |
+
margin-bottom: 1rem;
|
| 55 |
+
}
|
| 56 |
+
|
| 57 |
+
.summary-card h3 {
|
| 58 |
+
margin-top: 0;
|
| 59 |
+
}
|
templates/index.html
CHANGED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!DOCTYPE html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="UTF-8" />
|
| 5 |
+
<title>Opinion Summarizer</title>
|
| 6 |
+
<link rel="stylesheet" href="{{ url_for('static', filename='styles.css') }}" />
|
| 7 |
+
</head>
|
| 8 |
+
<body>
|
| 9 |
+
<header>
|
| 10 |
+
<h1>Opinion Summarizer</h1>
|
| 11 |
+
<p>Explore high-level themes from thousands of electronics reviews.</p>
|
| 12 |
+
</header>
|
| 13 |
+
<main>
|
| 14 |
+
<section>
|
| 15 |
+
<h2>Ask a question</h2>
|
| 16 |
+
<form action="{{ url_for('results') }}" method="post">
|
| 17 |
+
<label for="query">What would you like to know?</label>
|
| 18 |
+
<textarea id="query" name="query" placeholder="e.g. battery life of noise cancelling headphones" required></textarea>
|
| 19 |
+
<button type="submit">Search</button>
|
| 20 |
+
</form>
|
| 21 |
+
</section>
|
| 22 |
+
{% if summaries %}
|
| 23 |
+
<section>
|
| 24 |
+
<h2>Latest cluster summaries</h2>
|
| 25 |
+
{% for item in summaries %}
|
| 26 |
+
<div class="summary-card">
|
| 27 |
+
<h3>Cluster {{ item.cluster_id }} ({{ item.size }} reviews)</h3>
|
| 28 |
+
<p>{{ item.summary }}</p>
|
| 29 |
+
</div>
|
| 30 |
+
{% endfor %}
|
| 31 |
+
</section>
|
| 32 |
+
{% endif %}
|
| 33 |
+
</main>
|
| 34 |
+
</body>
|
| 35 |
+
</html>
|
templates/results.html
CHANGED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!DOCTYPE html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="UTF-8" />
|
| 5 |
+
<title>Search results • Opinion Summarizer</title>
|
| 6 |
+
<link rel="stylesheet" href="{{ url_for('static', filename='styles.css') }}" />
|
| 7 |
+
</head>
|
| 8 |
+
<body>
|
| 9 |
+
<header>
|
| 10 |
+
<h1>Search results</h1>
|
| 11 |
+
<p>Query: <strong>{{ query }}</strong></p>
|
| 12 |
+
<a href="{{ url_for('home') }}" style="color:white">← Back</a>
|
| 13 |
+
</header>
|
| 14 |
+
<main>
|
| 15 |
+
{% if results %}
|
| 16 |
+
<section>
|
| 17 |
+
{% for result in results %}
|
| 18 |
+
<div class="summary-card">
|
| 19 |
+
<h3>Score {{ '{:.2f}'.format(result.score * 100) }}%</h3>
|
| 20 |
+
<p>{{ result.text }}</p>
|
| 21 |
+
</div>
|
| 22 |
+
{% endfor %}
|
| 23 |
+
</section>
|
| 24 |
+
{% else %}
|
| 25 |
+
<p>No matching reviews found. Try a different question.</p>
|
| 26 |
+
{% endif %}
|
| 27 |
+
</main>
|
| 28 |
+
</body>
|
| 29 |
+
</html>
|