raij-ai / aspect_based_sentiment /Documentation.md
github-actions[bot]
chore: sync from GitHub 2026-04-04 21:04:08 UTC
73ad240

Aspect-Based Sentiment Analysis - Technical Documentation

This module extracts product aspects from reviews, classifies sentiment per aspect, and returns aggregated highlights, pros, cons, and an advisory-tone summary. English and Arabic reviews are both supported; all output is always in English.

Architecture

src/aspect_based_sentiment/routes.py           -> FastAPI endpoints
src/aspect_based_sentiment/aspect_sentiment.py -> ABSA pipeline + aggregation
src/models.py                                  -> Lazy-loaded NLP models
src/utils.py                                   -> Supabase client singleton

Endpoint

GET /product/{product_id}/review-summary

Query parameters

  • threshold_divisor (default 4.0): controls how strict aspect-level mention thresholds are.

The confidence threshold is fixed in code at 0.65 to filter low-confidence ABSA predictions.

Thresholds are computed as:

  • pos_threshold = total_reviews / threshold_divisor
  • neg_threshold = total_reviews / threshold_divisor

Lower threshold_divisor means stricter thresholds; higher values mean looser thresholds.

Pipeline

Reviews from Supabase
  -> [Arabic only] translate to English via Helsinki-NLP/opus-mt-ar-en
  -> extract noun chunks (candidate aspects) using spaCy en_core_web_md
  -> evaluate chunks against dynamic product title, tags, and categories to discard self-referential pronouns
  -> classify each (review, aspect) pair with DeBERTa ABSA
  -> normalize aspect names (for example, "the camera" -> "camera")
  -> aggregate positive/negative counts across reviews
  -> generate aspect-level UI lines using randomized templates
  -> generate 3-4 advisory sentences (Noon-style summary)
  -> return highlights + pros + cons

Stage details

  1. _is_arabic(text) / _translate_to_english(text) (routes.py)
  • Detects Arabic by Unicode character ratio (>30% Arabic chars triggers translation).
  • Translates with Helsinki-NLP/opus-mt-ar-en before any downstream processing.
  1. extract_aspects(text, product_title, product_tags, product_categories)
  • Expects English text; Arabic is pre-translated upstream.
  • Uses spaCy en_core_web_md noun chunks as candidate aspects.
  • Drops non-meaningful chunks (determiners, pronouns) and lowercases output.
  • Robust Self-Reference Filter: Dynamically prevents the algorithm from picking up the product itself (e.g. tracking "phone" as an aspect for a smartphone).
    • Gathers the product's title, tags, and database categories.
    • Automatically identifies and splits English compound terms (e.g., smartphone -> smart, phone).
    • Utilizes spaCy lemmatization to universally support plural categories (e.g., matching a review saying "phone" against the category "smartphones").
    • Performs a subset lemma match against these curated product features to instantly filter out generic nouns prior to ABSA computation.
  1. classify_aspects(review_text, aspects)
  • Runs ABSA classification with Positive, Negative, or Neutral labels.
  • Returns sentiment label and confidence score for each aspect.
  1. aggregate_pros_cons(...)
  • Filters predictions below the fixed confidence threshold (0.65).
  • Merges similar aspect strings with normalization.
  • Counts positive and negative mentions per aspect.
  • Builds an aspect-level summary line using randomized text templates based purely on the aggregated mention counts.
  • Produces:
    • highlights: ranked by total_mentions
    • pros: aspects where positive dominance exceeds threshold
    • cons: aspects where negative dominance exceeds threshold

Models Used

Component Model Role
Arabic translation Helsinki-NLP/opus-mt-ar-en Translate Arabic reviews to English
Aspect extraction en_core_web_md Noun-chunk extraction
ABSA classifier yangheng/deberta-v3-base-absa-v1.1 Sentiment per (review, aspect) pair

All models run on CPU with lazy loading.

Response Shape

{
  "product_id": "uuid-string",
  "total_reviews": 20,
  "highlights": [
    {
      "aspect": "camera",
      "summary": "Camera quality is excellent but it can overheat during long recording sessions.",
      "positive_mentions": 9,
      "negative_mentions": 4,
      "total_mentions": 13
    }
  ],
  "pros": [
    "Camera quality is excellent but it can overheat during long recording sessions."
  ],
  "cons": [
  "cons": [
    "The device overheats during long recording sessions."
  ]
}
  • highlights[].summary contains the Noon-style advisory sentences (combined from all pros/cons).

Low-Review Interpretation Guidance

For products with a small number of reviews, threshold-driven outputs can look stronger than the evidence really is.

Recommended interpretation policy:

  • Keep threshold logic enabled for consistency.
  • Treat outputs as low confidence when total_reviews < 10.
  • Prefer displaying highlights plus a low-confidence note rather than making strong pros/cons claims.

Data Source

Table: reviews

Fields used by ABSA endpoint:

  • id
  • product_id
  • rating
  • title (fallback when content is empty)
  • content (primary text)

sentiment is currently not used by this endpoint for serving responses.