Spaces:

Anshrathore01
/

opinion-summarizer

Sleeping

App Files Files Community

Anshrathore01 commited on Nov 17, 2025

Commit

0116d50

1 Parent(s): 9f0f097

Implement core pipelines and web UI

Browse files

Files changed (33) hide show

.gitignore +16 -0
README.md +56 -16
app.py +67 -0
artifacts/cleaned_data/.gitkeep +0 -0
artifacts/clustering/.gitkeep +0 -0
artifacts/embeddings/.gitkeep +0 -0
artifacts/models/.gitkeep +0 -0
artifacts/raw_data/README.md +5 -0
artifacts/summaries/.gitkeep +0 -0
notebooks/EDA.ipynb +15 -3
src/components/clustering_engine.py +30 -0
src/components/data_cleaning.py +45 -0
src/components/data_loader.py +45 -0
src/components/embedding_generator.py +40 -0
src/components/query_engine.py +37 -0
src/components/summarization_engine.py +45 -0
src/components/visualization.py +33 -0
src/config/cluster_config.json +6 -0
src/config/config.yaml +14 -0
src/config/model_config.json +6 -0
src/pipelines/build_embeddings_pipeline.py +42 -0
src/pipelines/clustering_pipeline.py +42 -0
src/pipelines/full_run_pipeline.py +19 -0
src/pipelines/query_pipeline.py +38 -0
src/pipelines/summarization_pipeline.py +47 -0
src/utils/__pycache__/exception.cpython-39.pyc +0 -0
src/utils/__pycache__/file_utils.cpython-39.pyc +0 -0
src/utils/__pycache__/logger.cpython-39.pyc +0 -0
src/utils/__pycache__/plot_utils.cpython-39.pyc +0 -0
src/utils/__pycache__/text_utils.cpython-39.pyc +0 -0
static/styles.css +59 -0
templates/index.html +35 -0
templates/results.html +29 -0

.gitignore CHANGED Viewed

@@ -1,5 +1,6 @@
 venv/
 .env/
@@ -18,6 +19,21 @@ artifacts/*
 !artifacts/summaries/
 !artifacts/models/
 # Local datasets too large for GitHub
 artifacts/raw_data/Electronics_5\ 2.json

 venv/
+.venv/
 .env/
 !artifacts/summaries/
 !artifacts/models/
+# Keep directory structure but ignore generated assets
+artifacts/raw_data/*
+!artifacts/raw_data/.gitkeep
+!artifacts/raw_data/README.md
+artifacts/cleaned_data/*
+!artifacts/cleaned_data/.gitkeep
+artifacts/embeddings/*
+!artifacts/embeddings/.gitkeep
+artifacts/clustering/*
+!artifacts/clustering/.gitkeep
+artifacts/summaries/*
+!artifacts/summaries/.gitkeep
+artifacts/models/*
+!artifacts/models/.gitkeep
 # Local datasets too large for GitHub
 artifacts/raw_data/Electronics_5\ 2.json

README.md CHANGED Viewed

@@ -1,18 +1,58 @@
 # Opinion-Summarizer-NLP
-An end-to-end project to summarize opinions from Amazon Electronics reviews:
-- sample 50k reviews from Kaggle dataset
-- clean & preprocess text
-- generate embeddings (SentenceTransformer)
-- cluster reviews (KMeans / HDBSCAN)
-- create cluster-level summaries (T5 / Pegasus)
-- build FAISS vector index for retrieval
-- Streamlit app for exploration and search
-## Setup
-1. Create a virtual environment and install dependencies:
-```bash
-python -m venv venv
-source venv/bin/activate
-pip install -r requirements.txt

 # Opinion-Summarizer-NLP
+An end-to-end workflow that turns raw Amazon electronics reviews into compact opinion summaries and a lightweight semantic search experience.
+## Project layout
+```
+├── src/components      # modular data/ML building blocks
+├── src/pipelines       # executable steps (load→embed→cluster→summarise)
+├── artifacts/          # generated assets (clean data, embeddings, etc.)
+├── templates/ + static/ # Flask UI
+└── notebooks/EDA.ipynb # exploratory analysis walkthrough
+```
+## Getting started
+1. **Install dependencies**
+   ```bash
+   python -m venv venv
+   source venv/bin/activate
+   pip install -r requirements.txt
+   ```
+2. **Place the sampled dataset** at `artifacts/raw_data/electronics_sample_50k.json`. This should be a JSONL file where each line is a review dict from the Amazon Electronics dataset.
+3. **Generate assets**
+   ```bash
+   python -m src.pipelines.build_embeddings_pipeline
+   python -m src.pipelines.clustering_pipeline
+   python -m src.pipelines.summarization_pipeline
+   ```
+   or simply run `python -m src.pipelines.full_run_pipeline` to execute all three.
+4. **Launch the app**
+   ```bash
+   flask --app app run --port 8000
+   ```
+## Pipelines
+| Step | Purpose | Output |
+| --- | --- | --- |
+| `build_embeddings_pipeline` | load → clean → embed reviews | `artifacts/cleaned_data/*.parquet`, `artifacts/embeddings/*.npy` |
+| `clustering_pipeline` | group reviews by semantic similarity | `artifacts/clustering/cluster_labels.csv` |
+| `summarization_pipeline` | produce abstractive summary per cluster | `artifacts/summaries/cluster_summaries.json` |
+## Web interface
+The Flask app exposes:
+- `/` overview page with the most recent cluster summaries
+- `/results` POST route to run semantic search over the indexed reviews
+Static styling lives in `static/styles.css`; HTML templates sit in `templates/`.
+## Notebook
+`notebooks/EDA.ipynb` reproduces the exploratory plots (length distributions, word clouds, rating histograms, etc.) over the sampled 50k reviews.
+## Configuration
+Tune paths and hyper-parameters inside `src/config/config.yaml`, `src/config/model_config.json`, and `src/config/cluster_config.json`.

app.py CHANGED Viewed

	@@ -0,0 +1,67 @@

+"""Flask entrypoint for the Opinion Summarizer demo."""
+from __future__ import annotations
+import json
+from dataclasses import dataclass
+from functools import lru_cache
+from pathlib import Path
+from typing import List
+import numpy as np
+import pandas as pd
+import yaml
+from flask import Flask, redirect, render_template, request, url_for
+from src.components.query_engine import QueryEngine
+app = Flask(__name__)
+CONFIG_PATH = Path("src/config/config.yaml")
+@dataclass
+class QueryResult:
+    text: str
+    score: float
+@lru_cache(maxsize=1)
+def load_config():
+    return yaml.safe_load(CONFIG_PATH.read_text())
+@lru_cache(maxsize=1)
+def load_query_engine() -> QueryEngine:
+    config = load_config()
+    data_cfg = config["data"]
+    embeddings = np.load(data_cfg["embeddings_path"])
+    df = pd.read_parquet(data_cfg["cleaned_path"])
+    return QueryEngine(embeddings=embeddings, documents=df["reviewText"].tolist())
+def load_cluster_summaries():
+    config = load_config()
+    summary_path = Path(config["data"]["summaries_path"])
+    if summary_path.exists():
+        return json.loads(summary_path.read_text())
+    return []
+@app.route("/")
+def home():
+    return render_template("index.html", summaries=load_cluster_summaries())
+@app.route("/results", methods=["POST"])
+def results():
+    query = request.form.get("query", "").strip()
+    if not query:
+        return redirect(url_for("home"))
+    engine = load_query_engine()
+    matches = engine.search(query)
+    results = [QueryResult(text=doc, score=score) for doc, score in matches]
+    return render_template("results.html", query=query, results=results)
+if __name__ == "__main__":
+    app.run(debug=True, port=8000)

artifacts/cleaned_data/.gitkeep ADDED Viewed

File without changes

artifacts/clustering/.gitkeep ADDED Viewed

File without changes

artifacts/embeddings/.gitkeep ADDED Viewed

File without changes

artifacts/models/.gitkeep ADDED Viewed

File without changes

artifacts/raw_data/README.md ADDED Viewed

	@@ -0,0 +1,5 @@

+# Raw data placeholder
+Drop your sampled Amazon Electronics JSONL file in this folder.
+The pipelines expect a file named `electronics_sample_50k.json`.
+These files are ignored by git so you can keep large datasets locally.

artifacts/summaries/.gitkeep ADDED Viewed

File without changes

notebooks/EDA.ipynb CHANGED Viewed

@@ -20,10 +20,22 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
    "id": "8faa2d0f",
    "metadata": {},
-   "outputs": [],
    "source": [
     "import json\n",
     "import pandas as pd\n",
@@ -285,7 +297,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9"
   }
  },
  "nbformat": 4,

   },
   {
    "cell_type": "code",
+   "execution_count": 1,
    "id": "8faa2d0f",
    "metadata": {},
+   "outputs": [
+    {
+     "ename": "ModuleNotFoundError",
+     "evalue": "No module named 'seaborn'",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mModuleNotFoundError\u001b[0m                       Traceback (most recent call last)",
+      "Cell \u001b[0;32mIn[1], line 4\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mpandas\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mas\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mpd\u001b[39;00m\n\u001b[1;32m      3\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mmatplotlib\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mpyplot\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mas\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mplt\u001b[39;00m\n\u001b[0;32m----> 4\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mseaborn\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mas\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01msns\u001b[39;00m\n\u001b[1;32m      5\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mwordcloud\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m WordCloud\n\u001b[1;32m      6\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mcollections\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m Counter\n",
+      "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'seaborn'"
+     ]
+    }
+   ],
    "source": [
     "import json\n",
     "import pandas as pd\n",
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
+   "version": "3.9.25"
   }
  },
  "nbformat": 4,

src/components/clustering_engine.py CHANGED Viewed

	@@ -0,0 +1,30 @@

+"""Clustering helpers for grouping similar reviews."""
+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Optional
+import numpy as np
+from sklearn.cluster import KMeans
+from sklearn.decomposition import PCA
+@dataclass
+class ClusteringEngine:
+    n_clusters: int = 20
+    random_state: int = 42
+    use_pca: bool = True
+    pca_components: Optional[int] = 50
+    def fit_predict(self, embeddings: np.ndarray) -> np.ndarray:
+        matrix = embeddings
+        if self.use_pca and self.pca_components and matrix.shape[1] > self.pca_components:
+            reducer = PCA(n_components=self.pca_components, random_state=self.random_state)
+            matrix = reducer.fit_transform(matrix)
+        model = KMeans(n_clusters=self.n_clusters, random_state=self.random_state, n_init="auto")
+        labels = model.fit_predict(matrix)
+        return labels
+__all__ = ["ClusteringEngine"]

src/components/data_cleaning.py CHANGED Viewed

	@@ -0,0 +1,45 @@

+"""Text cleaning routines for the review corpus."""
+from __future__ import annotations
+import re
+from dataclasses import dataclass
+from typing import Iterable, List
+import pandas as pd
+HTML_TAG_RE = re.compile(r"<[^>]+>")
+NON_ALPHA_RE = re.compile(r"[^a-zA-Z0-9\s]")
+MULTISPACE_RE = re.compile(r"\s+")
+@dataclass
+class ReviewCleaner:
+    lowercase: bool = True
+    def clean(self, text: str) -> str:
+        if not isinstance(text, str):
+            text = ""
+        if self.lowercase:
+            text = text.lower()
+        text = HTML_TAG_RE.sub(" ", text)
+        text = NON_ALPHA_RE.sub(" ", text)
+        text = MULTISPACE_RE.sub(" ", text)
+        return text.strip()
+    def clean_series(self, series: pd.Series) -> pd.Series:
+        return series.fillna("").map(self.clean)
+    def remove_short_reviews(self, df: pd.DataFrame, min_chars: int = 20) -> pd.DataFrame:
+        mask = df["reviewText"].str.len() >= min_chars
+        return df.loc[mask].copy()
+    def __call__(self, df: pd.DataFrame, min_chars: int = 20) -> pd.DataFrame:
+        df = df.copy()
+        df["clean_text"] = self.clean_series(df["reviewText"])
+        df = self.remove_short_reviews(df, min_chars=min_chars)
+        df = df.drop_duplicates(subset=["clean_text"])
+        return df.reset_index(drop=True)
+__all__ = ["ReviewCleaner"]

src/components/data_loader.py CHANGED Viewed

	@@ -0,0 +1,45 @@

+"""Utilities for loading raw Amazon electronics review data."""
+from __future__ import annotations
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Iterable, Optional
+import pandas as pd
+from src.utils.exception import CustomException
+@dataclass
+class ReviewDatasetLoader:
+    """Load JSON-lines review dumps with optional sampling."""
+    data_path: Path
+    sample_size: Optional[int] = None
+    random_state: int = 42
+    def _read_jsonl(self) -> Iterable[dict]:
+        if not self.data_path.exists():
+            raise CustomException(f"Dataset not found at {self.data_path}")
+        import json
+        with self.data_path.open("r", encoding="utf-8") as handle:
+            for line in handle:
+                line = line.strip()
+                if line:
+                    yield json.loads(line)
+    def load(self) -> pd.DataFrame:
+        records = list(self._read_jsonl())
+        if not records:
+            raise CustomException("Dataset file is empty")
+        df = pd.DataFrame(records)
+        df = df.dropna(subset=["reviewText"]).reset_index(drop=True)
+        if self.sample_size and len(df) > self.sample_size:
+            df = df.sample(self.sample_size, random_state=self.random_state)
+        df["reviewText"] = df["reviewText"].astype(str)
+        return df
+__all__ = ["ReviewDatasetLoader"]

src/components/embedding_generator.py CHANGED Viewed

	@@ -0,0 +1,40 @@

+"""Sentence embedding generation utilities."""
+from __future__ import annotations
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Iterable, List
+import numpy as np
+from sentence_transformers import SentenceTransformer
+from tqdm.auto import tqdm
+@dataclass
+class EmbeddingGenerator:
+    model_name: str = "sentence-transformers/all-MiniLM-L6-v2"
+    batch_size: int = 64
+    normalize: bool = True
+    def __post_init__(self) -> None:
+        self.model = SentenceTransformer(self.model_name)
+    def encode(self, texts: Iterable[str]) -> np.ndarray:
+        embeddings: List[np.ndarray] = []
+        batch: List[str] = []
+        for text in texts:
+            batch.append(text)
+            if len(batch) == self.batch_size:
+                embeddings.append(self.model.encode(batch, normalize_embeddings=self.normalize))
+                batch = []
+        if batch:
+            embeddings.append(self.model.encode(batch, normalize_embeddings=self.normalize))
+        return np.vstack(embeddings)
+    def save(self, embeddings: np.ndarray, path: Path) -> None:
+        path.parent.mkdir(parents=True, exist_ok=True)
+        np.save(path, embeddings)
+__all__ = ["EmbeddingGenerator"]

src/components/query_engine.py CHANGED Viewed

	@@ -0,0 +1,37 @@

+"""Simple semantic search over review embeddings."""
+from __future__ import annotations
+from dataclasses import dataclass
+from typing import List, Sequence, Tuple
+import numpy as np
+from sklearn.neighbors import NearestNeighbors
+from sentence_transformers import SentenceTransformer
+@dataclass
+class QueryEngine:
+    embeddings: np.ndarray
+    documents: Sequence[str]
+    model_name: str = "sentence-transformers/all-MiniLM-L6-v2"
+    top_k: int = 5
+    def __post_init__(self) -> None:
+        if len(self.documents) != len(self.embeddings):
+            raise ValueError("Embeddings and documents must be aligned")
+        self.model = SentenceTransformer(self.model_name)
+        self.index = NearestNeighbors(metric="cosine")
+        self.index.fit(self.embeddings)
+    def search(self, query: str) -> List[Tuple[str, float]]:
+        query_emb = self.model.encode([query])
+        distances, indices = self.index.kneighbors(query_emb, n_neighbors=self.top_k)
+        results = []
+        for dist, idx in zip(distances[0], indices[0]):
+            similarity = 1 - dist
+            results.append((self.documents[idx], float(similarity)))
+        return results
+__all__ = ["QueryEngine"]

src/components/summarization_engine.py CHANGED Viewed

	@@ -0,0 +1,45 @@

+"""Cluster-level abstractive summarisation helpers."""
+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Iterable
+from transformers import pipeline
+@dataclass
+class SummarizationEngine:
+    model_name: str = "google/pegasus-xsum"
+    max_length: int = 128
+    min_length: int = 32
+    max_input_chars: int = 6000
+    max_reviews: int = 100
+    def __post_init__(self) -> None:
+        self._pipeline = pipeline(
+            "summarization",
+            model=self.model_name,
+            tokenizer=self.model_name,
+        )
+    def summarize(self, texts: Iterable[str]) -> str:
+        cleaned = [text.strip() for text in texts if text and text.strip()]
+        if not cleaned:
+            return ""
+        if self.max_reviews:
+            cleaned = cleaned[: self.max_reviews]
+        joined = " ".join(cleaned)
+        if len(joined) > self.max_input_chars:
+            joined = joined[: self.max_input_chars]
+        output = self._pipeline(
+            joined,
+            max_length=self.max_length,
+            min_length=self.min_length,
+            do_sample=False,
+            truncation=True,
+        )
+        return output[0]["summary_text"].strip()
+__all__ = ["SummarizationEngine"]

src/components/visualization.py CHANGED Viewed

	@@ -0,0 +1,33 @@

+"""Utility plots for exploratory analysis."""
+from __future__ import annotations
+import matplotlib.pyplot as plt
+import seaborn as sns
+import pandas as pd
+sns.set_style("whitegrid")
+def plot_rating_distribution(df: pd.DataFrame):
+    if "overall" not in df.columns:
+        raise ValueError("Column 'overall' not present")
+    plt.figure(figsize=(7, 4))
+    sns.countplot(x="overall", data=df, palette="viridis")
+    plt.title("Ratings distribution")
+    return plt.gca()
+def plot_cluster_sizes(labels):
+    series = pd.Series(labels)
+    counts = series.value_counts().sort_index()
+    plt.figure(figsize=(10, 4))
+    counts.plot(kind="bar", color="#0b7fab")
+    plt.title("Cluster sizes")
+    plt.xlabel("Cluster id")
+    plt.ylabel("# Reviews")
+    return plt.gca()
+__all__ = ["plot_rating_distribution", "plot_cluster_sizes"]

src/config/cluster_config.json CHANGED Viewed

	@@ -0,0 +1,6 @@

+{
+  "n_clusters": 20,
+  "use_pca": true,
+  "pca_components": 50,
+  "random_state": 42
+}

src/config/config.yaml CHANGED Viewed

	@@ -0,0 +1,14 @@

+data:
+  raw_path: artifacts/raw_data/electronics_sample_50k.json
+  cleaned_path: artifacts/cleaned_data/clean_reviews.parquet
+  embeddings_path: artifacts/embeddings/review_embeddings.npy
+  cluster_assignments_path: artifacts/clustering/cluster_labels.csv
+  summaries_path: artifacts/summaries/cluster_summaries.json
+models:
+  embedding_model: sentence-transformers/all-MiniLM-L6-v2
+  summarizer_model: google/pegasus-xsum
+clustering:
+  n_clusters: 20
+  pca_components: 50
+app:
+  results_cache: artifacts/summaries/cluster_summaries.json

src/config/model_config.json CHANGED Viewed

	@@ -0,0 +1,6 @@

+{
+  "embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
+  "summarizer_model": "google/pegasus-xsum",
+  "max_summary_length": 128,
+  "min_summary_length": 32
+}

src/pipelines/build_embeddings_pipeline.py CHANGED Viewed

	@@ -0,0 +1,42 @@

+"""CLI to build review embeddings from the raw dataset."""
+from __future__ import annotations
+import argparse
+import json
+from pathlib import Path
+import numpy as np
+import yaml
+from src.components.data_loader import ReviewDatasetLoader
+from src.components.data_cleaning import ReviewCleaner
+from src.components.embedding_generator import EmbeddingGenerator
+def run(config_path: str = "src/config/config.yaml") -> None:
+    config = yaml.safe_load(Path(config_path).read_text())
+    data_cfg = config["data"]
+    model_cfg = config["models"]
+    loader = ReviewDatasetLoader(Path(data_cfg["raw_path"]))
+    cleaner = ReviewCleaner()
+    df = cleaner(loader.load())
+    generator = EmbeddingGenerator(model_name=model_cfg["embedding_model"])
+    embeddings = generator.encode(df["clean_text"].tolist())
+    generator.save(embeddings, Path(data_cfg["embeddings_path"]))
+    df[["reviewText", "clean_text"]].to_parquet(data_cfg["cleaned_path"], index=False)
+    print(f"Saved {len(df)} cleaned reviews and embeddings -> {data_cfg['embeddings_path']}")
+def cli() -> None:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--config", default="src/config/config.yaml")
+    args = parser.parse_args()
+    run(args.config)
+if __name__ == "__main__":
+    cli()

src/pipelines/clustering_pipeline.py CHANGED Viewed

	@@ -0,0 +1,42 @@

+"""Cluster review embeddings and persist cluster labels."""
+from __future__ import annotations
+import argparse
+from pathlib import Path
+import pandas as pd
+import numpy as np
+import yaml
+from src.components.clustering_engine import ClusteringEngine
+def run(config_path: str = "src/config/config.yaml") -> None:
+    config = yaml.safe_load(Path(config_path).read_text())
+    data_cfg = config["data"]
+    cluster_cfg = config["clustering"]
+    embeddings = np.load(data_cfg["embeddings_path"])
+    engine = ClusteringEngine(
+        n_clusters=cluster_cfg["n_clusters"],
+        use_pca=cluster_cfg.get("use_pca", True),
+        pca_components=cluster_cfg.get("pca_components"),
+    )
+    labels = engine.fit_predict(embeddings)
+    df = pd.read_parquet(data_cfg["cleaned_path"])
+    df["cluster_id"] = labels
+    df.to_csv(data_cfg["cluster_assignments_path"], index=False)
+    print(f"Clustered {len(df)} reviews into {labels.max()+1} clusters")
+def cli() -> None:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--config", default="src/config/config.yaml")
+    args = parser.parse_args()
+    run(args.config)
+if __name__ == "__main__":
+    cli()

src/pipelines/full_run_pipeline.py CHANGED Viewed

	@@ -0,0 +1,19 @@

+"""Execute the full offline pipeline: load -> clean -> embed -> cluster -> summarize."""
+from __future__ import annotations
+from pathlib import Path
+from src.pipelines.build_embeddings_pipeline import run as build_embeddings
+from src.pipelines.clustering_pipeline import run as cluster_reviews
+from src.pipelines.summarization_pipeline import run as summarize_clusters
+def run(config_path: str = "src/config/config.yaml") -> None:
+    build_embeddings(config_path)
+    cluster_reviews(config_path)
+    summarize_clusters(config_path)
+if __name__ == "__main__":
+    run()

src/pipelines/query_pipeline.py CHANGED Viewed

	@@ -0,0 +1,38 @@

+"""Expose a CLI helper to run semantic search queries."""
+from __future__ import annotations
+import argparse
+import json
+from pathlib import Path
+import numpy as np
+import pandas as pd
+import yaml
+from src.components.query_engine import QueryEngine
+def run(query: str, config_path: str = "src/config/config.yaml") -> None:
+    config = yaml.safe_load(Path(config_path).read_text())
+    data_cfg = config["data"]
+    embeddings = np.load(data_cfg["embeddings_path"])
+    df = pd.read_parquet(data_cfg["cleaned_path"])
+    engine = QueryEngine(embeddings=embeddings, documents=df["reviewText"].tolist())
+    results = engine.search(query)
+    for rank, (text, score) in enumerate(results, start=1):
+        print(f"[{rank}] score={score:.3f}\n{text}\n")
+def cli() -> None:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("query", help="Natural language query to look up")
+    parser.add_argument("--config", default="src/config/config.yaml")
+    args = parser.parse_args()
+    run(args.query, args.config)
+if __name__ == "__main__":
+    cli()

src/pipelines/summarization_pipeline.py CHANGED Viewed

	@@ -0,0 +1,47 @@

+"""Generate abstractive summaries for each review cluster."""
+from __future__ import annotations
+import argparse
+import json
+from pathlib import Path
+import pandas as pd
+import yaml
+from src.components.summarization_engine import SummarizationEngine
+def run(config_path: str = "src/config/config.yaml") -> None:
+    config = yaml.safe_load(Path(config_path).read_text())
+    data_cfg = config["data"]
+    model_cfg = config["models"]
+    df = pd.read_csv(data_cfg["cluster_assignments_path"])
+    engine = SummarizationEngine(
+        model_name=model_cfg["summarizer_model"],
+        max_length=model_cfg.get("max_summary_length", 128),
+        min_length=model_cfg.get("min_summary_length", 32),
+    )
+    summaries = []
+    for cluster_id, group in df.groupby("cluster_id"):
+        summary = engine.summarize(group["reviewText"].tolist()[:200])
+        summaries.append({
+            "cluster_id": int(cluster_id),
+            "summary": summary,
+            "size": int(len(group)),
+        })
+    Path(data_cfg["summaries_path"]).write_text(json.dumps(summaries, indent=2))
+    print(f"Wrote {len(summaries)} cluster summaries -> {data_cfg['summaries_path']}")
+def cli() -> None:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--config", default="src/config/config.yaml")
+    args = parser.parse_args()
+    run(args.config)
+if __name__ == "__main__":
+    cli()

src/utils/__pycache__/exception.cpython-39.pyc DELETED Viewed

Binary file (510 Bytes)

src/utils/__pycache__/file_utils.cpython-39.pyc DELETED Viewed

Binary file (1.01 kB)

src/utils/__pycache__/logger.cpython-39.pyc DELETED Viewed

Binary file (581 Bytes)

src/utils/__pycache__/plot_utils.cpython-39.pyc DELETED Viewed

Binary file (680 Bytes)

src/utils/__pycache__/text_utils.cpython-39.pyc DELETED Viewed

Binary file (590 Bytes)

static/styles.css CHANGED Viewed

	@@ -0,0 +1,59 @@

+body {
+    font-family: "Inter", "Segoe UI", sans-serif;
+    margin: 0;
+    padding: 0;
+    background: #f4f6fb;
+    color: #1f2933;
+}
+header {
+    background: linear-gradient(120deg, #0b7fab, #3aa17e);
+    color: white;
+    padding: 2rem;
+    text-align: center;
+}
+main {
+    max-width: 960px;
+    margin: 2rem auto;
+    background: white;
+    border-radius: 12px;
+    box-shadow: 0 20px 40px rgba(15, 23, 42, 0.1);
+    padding: 2rem 3rem;
+}
+form textarea {
+    width: 100%;
+    min-height: 140px;
+    border-radius: 8px;
+    border: 1px solid #d3dae6;
+    padding: 1rem;
+    font-size: 1rem;
+    resize: vertical;
+}
+button {
+    background: #0b7fab;
+    border: none;
+    color: white;
+    font-size: 1rem;
+    padding: 0.9rem 1.8rem;
+    border-radius: 999px;
+    cursor: pointer;
+    margin-top: 1rem;
+}
+button:hover {
+    background: #095c7d;
+}
+.summary-card {
+    border: 1px solid #e5e9f2;
+    border-radius: 10px;
+    padding: 1rem 1.5rem;
+    margin-bottom: 1rem;
+}
+.summary-card h3 {
+    margin-top: 0;
+}

templates/index.html CHANGED Viewed

	@@ -0,0 +1,35 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8" />
+    <title>Opinion Summarizer</title>
+    <link rel="stylesheet" href="{{ url_for('static', filename='styles.css') }}" />
+</head>
+<body>
+    <header>
+        <h1>Opinion Summarizer</h1>
+        <p>Explore high-level themes from thousands of electronics reviews.</p>
+    </header>
+    <main>
+        <section>
+            <h2>Ask a question</h2>
+            <form action="{{ url_for('results') }}" method="post">
+                <label for="query">What would you like to know?</label>
+                <textarea id="query" name="query" placeholder="e.g. battery life of noise cancelling headphones" required></textarea>
+                <button type="submit">Search</button>
+            </form>
+        </section>
+        {% if summaries %}
+        <section>
+            <h2>Latest cluster summaries</h2>
+            {% for item in summaries %}
+            <div class="summary-card">
+                <h3>Cluster {{ item.cluster_id }} ({{ item.size }} reviews)</h3>
+                <p>{{ item.summary }}</p>
+            </div>
+            {% endfor %}
+        </section>
+        {% endif %}
+    </main>
+</body>
+</html>

templates/results.html CHANGED Viewed

	@@ -0,0 +1,29 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8" />
+    <title>Search results • Opinion Summarizer</title>
+    <link rel="stylesheet" href="{{ url_for('static', filename='styles.css') }}" />
+</head>
+<body>
+    <header>
+        <h1>Search results</h1>
+        <p>Query: <strong>{{ query }}</strong></p>
+        <a href="{{ url_for('home') }}" style="color:white">← Back</a>
+    </header>
+    <main>
+        {% if results %}
+        <section>
+            {% for result in results %}
+            <div class="summary-card">
+                <h3>Score {{ '{:.2f}'.format(result.score * 100) }}%</h3>
+                <p>{{ result.text }}</p>
+            </div>
+            {% endfor %}
+        </section>
+        {% else %}
+        <p>No matching reviews found. Try a different question.</p>
+        {% endif %}
+    </main>
+</body>
+</html>