Anshrathore01 commited on
Commit
0116d50
·
1 Parent(s): 9f0f097

Implement core pipelines and web UI

Browse files
.gitignore CHANGED
@@ -1,5 +1,6 @@
1
 
2
  venv/
 
3
  .env/
4
 
5
 
@@ -18,6 +19,21 @@ artifacts/*
18
  !artifacts/summaries/
19
  !artifacts/models/
20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  # Local datasets too large for GitHub
22
  artifacts/raw_data/Electronics_5\ 2.json
23
 
 
1
 
2
  venv/
3
+ .venv/
4
  .env/
5
 
6
 
 
19
  !artifacts/summaries/
20
  !artifacts/models/
21
 
22
+ # Keep directory structure but ignore generated assets
23
+ artifacts/raw_data/*
24
+ !artifacts/raw_data/.gitkeep
25
+ !artifacts/raw_data/README.md
26
+ artifacts/cleaned_data/*
27
+ !artifacts/cleaned_data/.gitkeep
28
+ artifacts/embeddings/*
29
+ !artifacts/embeddings/.gitkeep
30
+ artifacts/clustering/*
31
+ !artifacts/clustering/.gitkeep
32
+ artifacts/summaries/*
33
+ !artifacts/summaries/.gitkeep
34
+ artifacts/models/*
35
+ !artifacts/models/.gitkeep
36
+
37
  # Local datasets too large for GitHub
38
  artifacts/raw_data/Electronics_5\ 2.json
39
 
README.md CHANGED
@@ -1,18 +1,58 @@
1
  # Opinion-Summarizer-NLP
2
 
3
- An end-to-end project to summarize opinions from Amazon Electronics reviews:
4
- - sample 50k reviews from Kaggle dataset
5
- - clean & preprocess text
6
- - generate embeddings (SentenceTransformer)
7
- - cluster reviews (KMeans / HDBSCAN)
8
- - create cluster-level summaries (T5 / Pegasus)
9
- - build FAISS vector index for retrieval
10
- - Streamlit app for exploration and search
11
-
12
- ## Setup
13
-
14
- 1. Create a virtual environment and install dependencies:
15
- ```bash
16
- python -m venv venv
17
- source venv/bin/activate
18
- pip install -r requirements.txt
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Opinion-Summarizer-NLP
2
 
3
+ An end-to-end workflow that turns raw Amazon electronics reviews into compact opinion summaries and a lightweight semantic search experience.
4
+
5
+ ## Project layout
6
+
7
+ ```
8
+ ├── src/components # modular data/ML building blocks
9
+ ├── src/pipelines # executable steps (load→embed→cluster→summarise)
10
+ ├── artifacts/ # generated assets (clean data, embeddings, etc.)
11
+ ├── templates/ + static/ # Flask UI
12
+ └── notebooks/EDA.ipynb # exploratory analysis walkthrough
13
+ ```
14
+
15
+ ## Getting started
16
+
17
+ 1. **Install dependencies**
18
+ ```bash
19
+ python -m venv venv
20
+ source venv/bin/activate
21
+ pip install -r requirements.txt
22
+ ```
23
+ 2. **Place the sampled dataset** at `artifacts/raw_data/electronics_sample_50k.json`. This should be a JSONL file where each line is a review dict from the Amazon Electronics dataset.
24
+ 3. **Generate assets**
25
+ ```bash
26
+ python -m src.pipelines.build_embeddings_pipeline
27
+ python -m src.pipelines.clustering_pipeline
28
+ python -m src.pipelines.summarization_pipeline
29
+ ```
30
+ or simply run `python -m src.pipelines.full_run_pipeline` to execute all three.
31
+ 4. **Launch the app**
32
+ ```bash
33
+ flask --app app run --port 8000
34
+ ```
35
+
36
+ ## Pipelines
37
+
38
+ | Step | Purpose | Output |
39
+ | --- | --- | --- |
40
+ | `build_embeddings_pipeline` | load → clean → embed reviews | `artifacts/cleaned_data/*.parquet`, `artifacts/embeddings/*.npy` |
41
+ | `clustering_pipeline` | group reviews by semantic similarity | `artifacts/clustering/cluster_labels.csv` |
42
+ | `summarization_pipeline` | produce abstractive summary per cluster | `artifacts/summaries/cluster_summaries.json` |
43
+
44
+ ## Web interface
45
+
46
+ The Flask app exposes:
47
+ - `/` overview page with the most recent cluster summaries
48
+ - `/results` POST route to run semantic search over the indexed reviews
49
+
50
+ Static styling lives in `static/styles.css`; HTML templates sit in `templates/`.
51
+
52
+ ## Notebook
53
+
54
+ `notebooks/EDA.ipynb` reproduces the exploratory plots (length distributions, word clouds, rating histograms, etc.) over the sampled 50k reviews.
55
+
56
+ ## Configuration
57
+
58
+ Tune paths and hyper-parameters inside `src/config/config.yaml`, `src/config/model_config.json`, and `src/config/cluster_config.json`.
app.py CHANGED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Flask entrypoint for the Opinion Summarizer demo."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import json
6
+ from dataclasses import dataclass
7
+ from functools import lru_cache
8
+ from pathlib import Path
9
+ from typing import List
10
+
11
+ import numpy as np
12
+ import pandas as pd
13
+ import yaml
14
+ from flask import Flask, redirect, render_template, request, url_for
15
+
16
+ from src.components.query_engine import QueryEngine
17
+
18
+ app = Flask(__name__)
19
+ CONFIG_PATH = Path("src/config/config.yaml")
20
+
21
+
22
+ @dataclass
23
+ class QueryResult:
24
+ text: str
25
+ score: float
26
+
27
+
28
+ @lru_cache(maxsize=1)
29
+ def load_config():
30
+ return yaml.safe_load(CONFIG_PATH.read_text())
31
+
32
+
33
+ @lru_cache(maxsize=1)
34
+ def load_query_engine() -> QueryEngine:
35
+ config = load_config()
36
+ data_cfg = config["data"]
37
+ embeddings = np.load(data_cfg["embeddings_path"])
38
+ df = pd.read_parquet(data_cfg["cleaned_path"])
39
+ return QueryEngine(embeddings=embeddings, documents=df["reviewText"].tolist())
40
+
41
+
42
+ def load_cluster_summaries():
43
+ config = load_config()
44
+ summary_path = Path(config["data"]["summaries_path"])
45
+ if summary_path.exists():
46
+ return json.loads(summary_path.read_text())
47
+ return []
48
+
49
+
50
+ @app.route("/")
51
+ def home():
52
+ return render_template("index.html", summaries=load_cluster_summaries())
53
+
54
+
55
+ @app.route("/results", methods=["POST"])
56
+ def results():
57
+ query = request.form.get("query", "").strip()
58
+ if not query:
59
+ return redirect(url_for("home"))
60
+ engine = load_query_engine()
61
+ matches = engine.search(query)
62
+ results = [QueryResult(text=doc, score=score) for doc, score in matches]
63
+ return render_template("results.html", query=query, results=results)
64
+
65
+
66
+ if __name__ == "__main__":
67
+ app.run(debug=True, port=8000)
artifacts/cleaned_data/.gitkeep ADDED
File without changes
artifacts/clustering/.gitkeep ADDED
File without changes
artifacts/embeddings/.gitkeep ADDED
File without changes
artifacts/models/.gitkeep ADDED
File without changes
artifacts/raw_data/README.md ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ # Raw data placeholder
2
+
3
+ Drop your sampled Amazon Electronics JSONL file in this folder.
4
+ The pipelines expect a file named `electronics_sample_50k.json`.
5
+ These files are ignored by git so you can keep large datasets locally.
artifacts/summaries/.gitkeep ADDED
File without changes
notebooks/EDA.ipynb CHANGED
@@ -20,10 +20,22 @@
20
  },
21
  {
22
  "cell_type": "code",
23
- "execution_count": null,
24
  "id": "8faa2d0f",
25
  "metadata": {},
26
- "outputs": [],
 
 
 
 
 
 
 
 
 
 
 
 
27
  "source": [
28
  "import json\n",
29
  "import pandas as pd\n",
@@ -285,7 +297,7 @@
285
  "name": "python",
286
  "nbconvert_exporter": "python",
287
  "pygments_lexer": "ipython3",
288
- "version": "3.9"
289
  }
290
  },
291
  "nbformat": 4,
 
20
  },
21
  {
22
  "cell_type": "code",
23
+ "execution_count": 1,
24
  "id": "8faa2d0f",
25
  "metadata": {},
26
+ "outputs": [
27
+ {
28
+ "ename": "ModuleNotFoundError",
29
+ "evalue": "No module named 'seaborn'",
30
+ "output_type": "error",
31
+ "traceback": [
32
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
33
+ "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)",
34
+ "Cell \u001b[0;32mIn[1], line 4\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mpandas\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mas\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mpd\u001b[39;00m\n\u001b[1;32m 3\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mmatplotlib\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mpyplot\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mas\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mplt\u001b[39;00m\n\u001b[0;32m----> 4\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mseaborn\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mas\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01msns\u001b[39;00m\n\u001b[1;32m 5\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mwordcloud\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m WordCloud\n\u001b[1;32m 6\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mcollections\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m Counter\n",
35
+ "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'seaborn'"
36
+ ]
37
+ }
38
+ ],
39
  "source": [
40
  "import json\n",
41
  "import pandas as pd\n",
 
297
  "name": "python",
298
  "nbconvert_exporter": "python",
299
  "pygments_lexer": "ipython3",
300
+ "version": "3.9.25"
301
  }
302
  },
303
  "nbformat": 4,
src/components/clustering_engine.py CHANGED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Clustering helpers for grouping similar reviews."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from dataclasses import dataclass
6
+ from typing import Optional
7
+
8
+ import numpy as np
9
+ from sklearn.cluster import KMeans
10
+ from sklearn.decomposition import PCA
11
+
12
+
13
+ @dataclass
14
+ class ClusteringEngine:
15
+ n_clusters: int = 20
16
+ random_state: int = 42
17
+ use_pca: bool = True
18
+ pca_components: Optional[int] = 50
19
+
20
+ def fit_predict(self, embeddings: np.ndarray) -> np.ndarray:
21
+ matrix = embeddings
22
+ if self.use_pca and self.pca_components and matrix.shape[1] > self.pca_components:
23
+ reducer = PCA(n_components=self.pca_components, random_state=self.random_state)
24
+ matrix = reducer.fit_transform(matrix)
25
+ model = KMeans(n_clusters=self.n_clusters, random_state=self.random_state, n_init="auto")
26
+ labels = model.fit_predict(matrix)
27
+ return labels
28
+
29
+
30
+ __all__ = ["ClusteringEngine"]
src/components/data_cleaning.py CHANGED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Text cleaning routines for the review corpus."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import re
6
+ from dataclasses import dataclass
7
+ from typing import Iterable, List
8
+
9
+ import pandas as pd
10
+
11
+ HTML_TAG_RE = re.compile(r"<[^>]+>")
12
+ NON_ALPHA_RE = re.compile(r"[^a-zA-Z0-9\s]")
13
+ MULTISPACE_RE = re.compile(r"\s+")
14
+
15
+
16
+ @dataclass
17
+ class ReviewCleaner:
18
+ lowercase: bool = True
19
+
20
+ def clean(self, text: str) -> str:
21
+ if not isinstance(text, str):
22
+ text = ""
23
+ if self.lowercase:
24
+ text = text.lower()
25
+ text = HTML_TAG_RE.sub(" ", text)
26
+ text = NON_ALPHA_RE.sub(" ", text)
27
+ text = MULTISPACE_RE.sub(" ", text)
28
+ return text.strip()
29
+
30
+ def clean_series(self, series: pd.Series) -> pd.Series:
31
+ return series.fillna("").map(self.clean)
32
+
33
+ def remove_short_reviews(self, df: pd.DataFrame, min_chars: int = 20) -> pd.DataFrame:
34
+ mask = df["reviewText"].str.len() >= min_chars
35
+ return df.loc[mask].copy()
36
+
37
+ def __call__(self, df: pd.DataFrame, min_chars: int = 20) -> pd.DataFrame:
38
+ df = df.copy()
39
+ df["clean_text"] = self.clean_series(df["reviewText"])
40
+ df = self.remove_short_reviews(df, min_chars=min_chars)
41
+ df = df.drop_duplicates(subset=["clean_text"])
42
+ return df.reset_index(drop=True)
43
+
44
+
45
+ __all__ = ["ReviewCleaner"]
src/components/data_loader.py CHANGED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Utilities for loading raw Amazon electronics review data."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from dataclasses import dataclass
6
+ from pathlib import Path
7
+ from typing import Iterable, Optional
8
+
9
+ import pandas as pd
10
+
11
+ from src.utils.exception import CustomException
12
+
13
+
14
+ @dataclass
15
+ class ReviewDatasetLoader:
16
+ """Load JSON-lines review dumps with optional sampling."""
17
+
18
+ data_path: Path
19
+ sample_size: Optional[int] = None
20
+ random_state: int = 42
21
+
22
+ def _read_jsonl(self) -> Iterable[dict]:
23
+ if not self.data_path.exists():
24
+ raise CustomException(f"Dataset not found at {self.data_path}")
25
+ import json
26
+
27
+ with self.data_path.open("r", encoding="utf-8") as handle:
28
+ for line in handle:
29
+ line = line.strip()
30
+ if line:
31
+ yield json.loads(line)
32
+
33
+ def load(self) -> pd.DataFrame:
34
+ records = list(self._read_jsonl())
35
+ if not records:
36
+ raise CustomException("Dataset file is empty")
37
+ df = pd.DataFrame(records)
38
+ df = df.dropna(subset=["reviewText"]).reset_index(drop=True)
39
+ if self.sample_size and len(df) > self.sample_size:
40
+ df = df.sample(self.sample_size, random_state=self.random_state)
41
+ df["reviewText"] = df["reviewText"].astype(str)
42
+ return df
43
+
44
+
45
+ __all__ = ["ReviewDatasetLoader"]
src/components/embedding_generator.py CHANGED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Sentence embedding generation utilities."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from dataclasses import dataclass
6
+ from pathlib import Path
7
+ from typing import Iterable, List
8
+
9
+ import numpy as np
10
+ from sentence_transformers import SentenceTransformer
11
+ from tqdm.auto import tqdm
12
+
13
+
14
+ @dataclass
15
+ class EmbeddingGenerator:
16
+ model_name: str = "sentence-transformers/all-MiniLM-L6-v2"
17
+ batch_size: int = 64
18
+ normalize: bool = True
19
+
20
+ def __post_init__(self) -> None:
21
+ self.model = SentenceTransformer(self.model_name)
22
+
23
+ def encode(self, texts: Iterable[str]) -> np.ndarray:
24
+ embeddings: List[np.ndarray] = []
25
+ batch: List[str] = []
26
+ for text in texts:
27
+ batch.append(text)
28
+ if len(batch) == self.batch_size:
29
+ embeddings.append(self.model.encode(batch, normalize_embeddings=self.normalize))
30
+ batch = []
31
+ if batch:
32
+ embeddings.append(self.model.encode(batch, normalize_embeddings=self.normalize))
33
+ return np.vstack(embeddings)
34
+
35
+ def save(self, embeddings: np.ndarray, path: Path) -> None:
36
+ path.parent.mkdir(parents=True, exist_ok=True)
37
+ np.save(path, embeddings)
38
+
39
+
40
+ __all__ = ["EmbeddingGenerator"]
src/components/query_engine.py CHANGED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Simple semantic search over review embeddings."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from dataclasses import dataclass
6
+ from typing import List, Sequence, Tuple
7
+
8
+ import numpy as np
9
+ from sklearn.neighbors import NearestNeighbors
10
+ from sentence_transformers import SentenceTransformer
11
+
12
+
13
+ @dataclass
14
+ class QueryEngine:
15
+ embeddings: np.ndarray
16
+ documents: Sequence[str]
17
+ model_name: str = "sentence-transformers/all-MiniLM-L6-v2"
18
+ top_k: int = 5
19
+
20
+ def __post_init__(self) -> None:
21
+ if len(self.documents) != len(self.embeddings):
22
+ raise ValueError("Embeddings and documents must be aligned")
23
+ self.model = SentenceTransformer(self.model_name)
24
+ self.index = NearestNeighbors(metric="cosine")
25
+ self.index.fit(self.embeddings)
26
+
27
+ def search(self, query: str) -> List[Tuple[str, float]]:
28
+ query_emb = self.model.encode([query])
29
+ distances, indices = self.index.kneighbors(query_emb, n_neighbors=self.top_k)
30
+ results = []
31
+ for dist, idx in zip(distances[0], indices[0]):
32
+ similarity = 1 - dist
33
+ results.append((self.documents[idx], float(similarity)))
34
+ return results
35
+
36
+
37
+ __all__ = ["QueryEngine"]
src/components/summarization_engine.py CHANGED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Cluster-level abstractive summarisation helpers."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from dataclasses import dataclass
6
+ from typing import Iterable
7
+
8
+ from transformers import pipeline
9
+
10
+
11
+ @dataclass
12
+ class SummarizationEngine:
13
+ model_name: str = "google/pegasus-xsum"
14
+ max_length: int = 128
15
+ min_length: int = 32
16
+ max_input_chars: int = 6000
17
+ max_reviews: int = 100
18
+
19
+ def __post_init__(self) -> None:
20
+ self._pipeline = pipeline(
21
+ "summarization",
22
+ model=self.model_name,
23
+ tokenizer=self.model_name,
24
+ )
25
+
26
+ def summarize(self, texts: Iterable[str]) -> str:
27
+ cleaned = [text.strip() for text in texts if text and text.strip()]
28
+ if not cleaned:
29
+ return ""
30
+ if self.max_reviews:
31
+ cleaned = cleaned[: self.max_reviews]
32
+ joined = " ".join(cleaned)
33
+ if len(joined) > self.max_input_chars:
34
+ joined = joined[: self.max_input_chars]
35
+ output = self._pipeline(
36
+ joined,
37
+ max_length=self.max_length,
38
+ min_length=self.min_length,
39
+ do_sample=False,
40
+ truncation=True,
41
+ )
42
+ return output[0]["summary_text"].strip()
43
+
44
+
45
+ __all__ = ["SummarizationEngine"]
src/components/visualization.py CHANGED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Utility plots for exploratory analysis."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import matplotlib.pyplot as plt
6
+ import seaborn as sns
7
+ import pandas as pd
8
+
9
+
10
+ sns.set_style("whitegrid")
11
+
12
+
13
+ def plot_rating_distribution(df: pd.DataFrame):
14
+ if "overall" not in df.columns:
15
+ raise ValueError("Column 'overall' not present")
16
+ plt.figure(figsize=(7, 4))
17
+ sns.countplot(x="overall", data=df, palette="viridis")
18
+ plt.title("Ratings distribution")
19
+ return plt.gca()
20
+
21
+
22
+ def plot_cluster_sizes(labels):
23
+ series = pd.Series(labels)
24
+ counts = series.value_counts().sort_index()
25
+ plt.figure(figsize=(10, 4))
26
+ counts.plot(kind="bar", color="#0b7fab")
27
+ plt.title("Cluster sizes")
28
+ plt.xlabel("Cluster id")
29
+ plt.ylabel("# Reviews")
30
+ return plt.gca()
31
+
32
+
33
+ __all__ = ["plot_rating_distribution", "plot_cluster_sizes"]
src/config/cluster_config.json CHANGED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "n_clusters": 20,
3
+ "use_pca": true,
4
+ "pca_components": 50,
5
+ "random_state": 42
6
+ }
src/config/config.yaml CHANGED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ data:
2
+ raw_path: artifacts/raw_data/electronics_sample_50k.json
3
+ cleaned_path: artifacts/cleaned_data/clean_reviews.parquet
4
+ embeddings_path: artifacts/embeddings/review_embeddings.npy
5
+ cluster_assignments_path: artifacts/clustering/cluster_labels.csv
6
+ summaries_path: artifacts/summaries/cluster_summaries.json
7
+ models:
8
+ embedding_model: sentence-transformers/all-MiniLM-L6-v2
9
+ summarizer_model: google/pegasus-xsum
10
+ clustering:
11
+ n_clusters: 20
12
+ pca_components: 50
13
+ app:
14
+ results_cache: artifacts/summaries/cluster_summaries.json
src/config/model_config.json CHANGED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
3
+ "summarizer_model": "google/pegasus-xsum",
4
+ "max_summary_length": 128,
5
+ "min_summary_length": 32
6
+ }
src/pipelines/build_embeddings_pipeline.py CHANGED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """CLI to build review embeddings from the raw dataset."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import argparse
6
+ import json
7
+ from pathlib import Path
8
+
9
+ import numpy as np
10
+ import yaml
11
+
12
+ from src.components.data_loader import ReviewDatasetLoader
13
+ from src.components.data_cleaning import ReviewCleaner
14
+ from src.components.embedding_generator import EmbeddingGenerator
15
+
16
+
17
+ def run(config_path: str = "src/config/config.yaml") -> None:
18
+ config = yaml.safe_load(Path(config_path).read_text())
19
+ data_cfg = config["data"]
20
+ model_cfg = config["models"]
21
+
22
+ loader = ReviewDatasetLoader(Path(data_cfg["raw_path"]))
23
+ cleaner = ReviewCleaner()
24
+ df = cleaner(loader.load())
25
+
26
+ generator = EmbeddingGenerator(model_name=model_cfg["embedding_model"])
27
+ embeddings = generator.encode(df["clean_text"].tolist())
28
+ generator.save(embeddings, Path(data_cfg["embeddings_path"]))
29
+
30
+ df[["reviewText", "clean_text"]].to_parquet(data_cfg["cleaned_path"], index=False)
31
+ print(f"Saved {len(df)} cleaned reviews and embeddings -> {data_cfg['embeddings_path']}")
32
+
33
+
34
+ def cli() -> None:
35
+ parser = argparse.ArgumentParser(description=__doc__)
36
+ parser.add_argument("--config", default="src/config/config.yaml")
37
+ args = parser.parse_args()
38
+ run(args.config)
39
+
40
+
41
+ if __name__ == "__main__":
42
+ cli()
src/pipelines/clustering_pipeline.py CHANGED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Cluster review embeddings and persist cluster labels."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import argparse
6
+ from pathlib import Path
7
+
8
+ import pandas as pd
9
+ import numpy as np
10
+ import yaml
11
+
12
+ from src.components.clustering_engine import ClusteringEngine
13
+
14
+
15
+ def run(config_path: str = "src/config/config.yaml") -> None:
16
+ config = yaml.safe_load(Path(config_path).read_text())
17
+ data_cfg = config["data"]
18
+ cluster_cfg = config["clustering"]
19
+
20
+ embeddings = np.load(data_cfg["embeddings_path"])
21
+ engine = ClusteringEngine(
22
+ n_clusters=cluster_cfg["n_clusters"],
23
+ use_pca=cluster_cfg.get("use_pca", True),
24
+ pca_components=cluster_cfg.get("pca_components"),
25
+ )
26
+ labels = engine.fit_predict(embeddings)
27
+
28
+ df = pd.read_parquet(data_cfg["cleaned_path"])
29
+ df["cluster_id"] = labels
30
+ df.to_csv(data_cfg["cluster_assignments_path"], index=False)
31
+ print(f"Clustered {len(df)} reviews into {labels.max()+1} clusters")
32
+
33
+
34
+ def cli() -> None:
35
+ parser = argparse.ArgumentParser(description=__doc__)
36
+ parser.add_argument("--config", default="src/config/config.yaml")
37
+ args = parser.parse_args()
38
+ run(args.config)
39
+
40
+
41
+ if __name__ == "__main__":
42
+ cli()
src/pipelines/full_run_pipeline.py CHANGED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Execute the full offline pipeline: load -> clean -> embed -> cluster -> summarize."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from pathlib import Path
6
+
7
+ from src.pipelines.build_embeddings_pipeline import run as build_embeddings
8
+ from src.pipelines.clustering_pipeline import run as cluster_reviews
9
+ from src.pipelines.summarization_pipeline import run as summarize_clusters
10
+
11
+
12
+ def run(config_path: str = "src/config/config.yaml") -> None:
13
+ build_embeddings(config_path)
14
+ cluster_reviews(config_path)
15
+ summarize_clusters(config_path)
16
+
17
+
18
+ if __name__ == "__main__":
19
+ run()
src/pipelines/query_pipeline.py CHANGED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Expose a CLI helper to run semantic search queries."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import argparse
6
+ import json
7
+ from pathlib import Path
8
+
9
+ import numpy as np
10
+ import pandas as pd
11
+ import yaml
12
+
13
+ from src.components.query_engine import QueryEngine
14
+
15
+
16
+ def run(query: str, config_path: str = "src/config/config.yaml") -> None:
17
+ config = yaml.safe_load(Path(config_path).read_text())
18
+ data_cfg = config["data"]
19
+ embeddings = np.load(data_cfg["embeddings_path"])
20
+ df = pd.read_parquet(data_cfg["cleaned_path"])
21
+
22
+ engine = QueryEngine(embeddings=embeddings, documents=df["reviewText"].tolist())
23
+ results = engine.search(query)
24
+
25
+ for rank, (text, score) in enumerate(results, start=1):
26
+ print(f"[{rank}] score={score:.3f}\n{text}\n")
27
+
28
+
29
+ def cli() -> None:
30
+ parser = argparse.ArgumentParser(description=__doc__)
31
+ parser.add_argument("query", help="Natural language query to look up")
32
+ parser.add_argument("--config", default="src/config/config.yaml")
33
+ args = parser.parse_args()
34
+ run(args.query, args.config)
35
+
36
+
37
+ if __name__ == "__main__":
38
+ cli()
src/pipelines/summarization_pipeline.py CHANGED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Generate abstractive summaries for each review cluster."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import argparse
6
+ import json
7
+ from pathlib import Path
8
+
9
+ import pandas as pd
10
+ import yaml
11
+
12
+ from src.components.summarization_engine import SummarizationEngine
13
+
14
+
15
+ def run(config_path: str = "src/config/config.yaml") -> None:
16
+ config = yaml.safe_load(Path(config_path).read_text())
17
+ data_cfg = config["data"]
18
+ model_cfg = config["models"]
19
+
20
+ df = pd.read_csv(data_cfg["cluster_assignments_path"])
21
+ engine = SummarizationEngine(
22
+ model_name=model_cfg["summarizer_model"],
23
+ max_length=model_cfg.get("max_summary_length", 128),
24
+ min_length=model_cfg.get("min_summary_length", 32),
25
+ )
26
+
27
+ summaries = []
28
+ for cluster_id, group in df.groupby("cluster_id"):
29
+ summary = engine.summarize(group["reviewText"].tolist()[:200])
30
+ summaries.append({
31
+ "cluster_id": int(cluster_id),
32
+ "summary": summary,
33
+ "size": int(len(group)),
34
+ })
35
+ Path(data_cfg["summaries_path"]).write_text(json.dumps(summaries, indent=2))
36
+ print(f"Wrote {len(summaries)} cluster summaries -> {data_cfg['summaries_path']}")
37
+
38
+
39
+ def cli() -> None:
40
+ parser = argparse.ArgumentParser(description=__doc__)
41
+ parser.add_argument("--config", default="src/config/config.yaml")
42
+ args = parser.parse_args()
43
+ run(args.config)
44
+
45
+
46
+ if __name__ == "__main__":
47
+ cli()
src/utils/__pycache__/exception.cpython-39.pyc DELETED
Binary file (510 Bytes)
 
src/utils/__pycache__/file_utils.cpython-39.pyc DELETED
Binary file (1.01 kB)
 
src/utils/__pycache__/logger.cpython-39.pyc DELETED
Binary file (581 Bytes)
 
src/utils/__pycache__/plot_utils.cpython-39.pyc DELETED
Binary file (680 Bytes)
 
src/utils/__pycache__/text_utils.cpython-39.pyc DELETED
Binary file (590 Bytes)
 
static/styles.css CHANGED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ body {
2
+ font-family: "Inter", "Segoe UI", sans-serif;
3
+ margin: 0;
4
+ padding: 0;
5
+ background: #f4f6fb;
6
+ color: #1f2933;
7
+ }
8
+
9
+ header {
10
+ background: linear-gradient(120deg, #0b7fab, #3aa17e);
11
+ color: white;
12
+ padding: 2rem;
13
+ text-align: center;
14
+ }
15
+
16
+ main {
17
+ max-width: 960px;
18
+ margin: 2rem auto;
19
+ background: white;
20
+ border-radius: 12px;
21
+ box-shadow: 0 20px 40px rgba(15, 23, 42, 0.1);
22
+ padding: 2rem 3rem;
23
+ }
24
+
25
+ form textarea {
26
+ width: 100%;
27
+ min-height: 140px;
28
+ border-radius: 8px;
29
+ border: 1px solid #d3dae6;
30
+ padding: 1rem;
31
+ font-size: 1rem;
32
+ resize: vertical;
33
+ }
34
+
35
+ button {
36
+ background: #0b7fab;
37
+ border: none;
38
+ color: white;
39
+ font-size: 1rem;
40
+ padding: 0.9rem 1.8rem;
41
+ border-radius: 999px;
42
+ cursor: pointer;
43
+ margin-top: 1rem;
44
+ }
45
+
46
+ button:hover {
47
+ background: #095c7d;
48
+ }
49
+
50
+ .summary-card {
51
+ border: 1px solid #e5e9f2;
52
+ border-radius: 10px;
53
+ padding: 1rem 1.5rem;
54
+ margin-bottom: 1rem;
55
+ }
56
+
57
+ .summary-card h3 {
58
+ margin-top: 0;
59
+ }
templates/index.html CHANGED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8" />
5
+ <title>Opinion Summarizer</title>
6
+ <link rel="stylesheet" href="{{ url_for('static', filename='styles.css') }}" />
7
+ </head>
8
+ <body>
9
+ <header>
10
+ <h1>Opinion Summarizer</h1>
11
+ <p>Explore high-level themes from thousands of electronics reviews.</p>
12
+ </header>
13
+ <main>
14
+ <section>
15
+ <h2>Ask a question</h2>
16
+ <form action="{{ url_for('results') }}" method="post">
17
+ <label for="query">What would you like to know?</label>
18
+ <textarea id="query" name="query" placeholder="e.g. battery life of noise cancelling headphones" required></textarea>
19
+ <button type="submit">Search</button>
20
+ </form>
21
+ </section>
22
+ {% if summaries %}
23
+ <section>
24
+ <h2>Latest cluster summaries</h2>
25
+ {% for item in summaries %}
26
+ <div class="summary-card">
27
+ <h3>Cluster {{ item.cluster_id }} ({{ item.size }} reviews)</h3>
28
+ <p>{{ item.summary }}</p>
29
+ </div>
30
+ {% endfor %}
31
+ </section>
32
+ {% endif %}
33
+ </main>
34
+ </body>
35
+ </html>
templates/results.html CHANGED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8" />
5
+ <title>Search results • Opinion Summarizer</title>
6
+ <link rel="stylesheet" href="{{ url_for('static', filename='styles.css') }}" />
7
+ </head>
8
+ <body>
9
+ <header>
10
+ <h1>Search results</h1>
11
+ <p>Query: <strong>{{ query }}</strong></p>
12
+ <a href="{{ url_for('home') }}" style="color:white">← Back</a>
13
+ </header>
14
+ <main>
15
+ {% if results %}
16
+ <section>
17
+ {% for result in results %}
18
+ <div class="summary-card">
19
+ <h3>Score {{ '{:.2f}'.format(result.score * 100) }}%</h3>
20
+ <p>{{ result.text }}</p>
21
+ </div>
22
+ {% endfor %}
23
+ </section>
24
+ {% else %}
25
+ <p>No matching reviews found. Try a different question.</p>
26
+ {% endif %}
27
+ </main>
28
+ </body>
29
+ </html>