Review-RAG / absa_project_plan.md.resolved
HariHaran9597
Initial commit
1d70196
# Aspect-Based Review Intelligence System β€” Complete Build Plan
> **Goal:** Fine-tune RoBERTa for aspect-based sentiment analysis + RAG Q&A layer + Streamlit dashboard.
> **Resume line:** NLP + Transformer Fine-tuning + RAG + FastAPI + Deployed
---
## Project Architecture
```mermaid
graph LR
A[Raw Reviews] --> B[Data Processing]
B --> C[Fine-tune RoBERTa]
C --> D[ABSA Model]
B --> E[Embed Reviews]
E --> F[FAISS Vector Store]
D --> G[FastAPI Backend]
F --> G
G --> H[Streamlit Dashboard]
H --> I1[Aspect Sentiment Heatmap]
H --> I2[Trend Charts]
H --> I3[Natural Language Q&A]
```
---
## What the System Does
```
INPUT: "The pizza was amazing but the waiter was incredibly rude and slow"
OUTPUT (ABSA Model):
β”œβ”€β”€ food β†’ Positive (confidence: 0.94)
β”œβ”€β”€ service β†’ Negative (confidence: 0.91)
β”œβ”€β”€ ambiance β†’ No mention
└── price β†’ No mention
OUTPUT (RAG Q&A):
User: "Why do customers complain about service?"
System: "Based on 847 reviews, the top service complaints are:
1. Slow wait times (mentioned in 34% of negative reviews)
2. Rude staff behavior (28%)
3. Order mistakes (19%)"
```
---
## Dataset: SemEval 2014 Task 4
| Detail | Value |
|---|---|
| **Name** | SemEval-2014 Task 4: Aspect-Based Sentiment Analysis |
| **Domain** | Restaurant reviews (also has Laptop β€” use Restaurant) |
| **Size** | ~3,000 training sentences, ~800 test sentences |
| **Labels** | Aspect categories: `food`, `service`, `ambiance`, `price`, `anecdotes/miscellaneous` |
| **Sentiments** | `positive`, `negative`, `neutral`, `conflict` |
| **Format** | XML |
| **Why this dataset** | Standard academic benchmark. Any interviewer who knows NLP will recognize it. Your results are directly comparable to published papers. |
> [!IMPORTANT]
> **Download link:** [SemEval 2014 Task 4 Dataset](https://alt.qcri.org/semeval2014/task4/index.php?id=data-and-tools)
> Download the Restaurant train + test XML files.
---
## Complete File Structure
```
review-intelligence/
β”‚
β”œβ”€β”€ data/
β”‚ β”œβ”€β”€ raw/ # SemEval XML files go here
β”‚ β”‚ β”œβ”€β”€ Restaurants_Train_v2.xml
β”‚ β”‚ └── Restaurants_Test_Gold.xml
β”‚ └── processed/ # Generated CSVs
β”‚ β”œβ”€β”€ train.csv
β”‚ └── test.csv
β”‚
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ data_processing.py # Step 1: Parse XML β†’ CSV
β”‚ β”œβ”€β”€ train_absa.py # Step 2: Fine-tune RoBERTa
β”‚ β”œβ”€β”€ evaluate.py # Step 3: Evaluate model
β”‚ β”œβ”€β”€ inference.py # Step 4: Single-review prediction
β”‚ β”œβ”€β”€ build_vectorstore.py # Step 5: Embed reviews β†’ FAISS
β”‚ └── rag_engine.py # Step 6: RAG retrieval + LLM answer
β”‚
β”œβ”€β”€ api/
β”‚ └── main.py # Step 7: FastAPI backend
β”‚
β”œβ”€β”€ app/
β”‚ └── streamlit_app.py # Step 8: Dashboard
β”‚
β”œβ”€β”€ models/ # Saved fine-tuned model
β”œβ”€β”€ vectorstore/ # FAISS index files
β”‚
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ .env # API keys (Groq/OpenAI)
└── README.md
```
**Total files to write: 12** (8 Python + 4 config/docs)
---
## Step-by-Step Implementation
---
### Step 1: `src/data_processing.py` β€” Parse SemEval XML to CSV
**What it does:** Reads the XML format, extracts each sentence + aspect category + sentiment, outputs a clean CSV.
**Input XML format:**
```xml
<sentence id="1">
<text>The pizza was amazing but the waiter was rude.</text>
<aspectCategories>
<aspectCategory category="food" polarity="positive"/>
<aspectCategory category="service" polarity="negative"/>
</aspectCategories>
</sentence>
```
**Output CSV format:**
| text | aspect | sentiment |
|---|---|---|
| The pizza was amazing but the waiter was rude. | food | positive |
| The pizza was amazing but the waiter was rude. | service | negative |
**Key logic:**
```python
import xml.etree.ElementTree as ET
import pandas as pd
def parse_semeval_xml(xml_path):
tree = ET.parse(xml_path)
root = tree.getroot()
rows = []
for sentence in root.findall('.//sentence'):
text = sentence.find('text').text
for aspect_cat in sentence.findall('.//aspectCategory'):
rows.append({
'text': text,
'aspect': aspect_cat.get('category'),
'sentiment': aspect_cat.get('polarity')
})
return pd.DataFrame(rows)
```
**Label mapping:**
```python
SENTIMENT_MAP = {'positive': 0, 'negative': 1, 'neutral': 2, 'conflict': 3}
ASPECT_CATEGORIES = ['food', 'service', 'ambiance', 'price', 'anecdotes/miscellaneous']
```
---
### Step 2: `src/train_absa.py` β€” Fine-tune RoBERTa
**What it does:** Fine-tunes `roberta-base` for aspect-based sentiment classification.
**Model approach:** Auxiliary Sentence Pair Classification
- Input to model: `"[CLS] The pizza was amazing but waiter was rude [SEP] food [SEP]"`
- Output: `positive` (for the "food" aspect)
- This converts ABSA into a standard sentence-pair classification task that RoBERTa handles natively.
**Key implementation details:**
```python
# Tokenization β€” sentence pair format
# Sentence A = review text
# Sentence B = aspect category name
inputs = tokenizer(
review_text, # "The pizza was amazing..."
aspect_category, # "food"
truncation=True,
padding='max_length',
max_length=128,
return_tensors='pt'
)
```
**Training config:**
| Parameter | Value | Why |
|---|---|---|
| Base model | `roberta-base` | Best balance of size vs accuracy for this task |
| Learning rate | `2e-5` | Standard for transformer fine-tuning |
| Batch size | `16` | Fits in ~6GB GPU / free Colab |
| Epochs | `5` | SemEval is small; more epochs = overfitting |
| Max length | `128` | Restaurant reviews are short |
| Optimizer | AdamW | Standard for transformers |
| Scheduler | Linear warmup (10% steps) | Prevents early instability |
| Loss | CrossEntropyLoss | 4-class classification |
**Libraries needed:**
```python
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from transformers import Trainer, TrainingArguments
from datasets import Dataset
from sklearn.model_selection import train_test_split
```
**Training loop (using HuggingFace Trainer):**
```python
model = RobertaForSequenceClassification.from_pretrained(
'roberta-base', num_labels=4 # pos, neg, neutral, conflict
)
training_args = TrainingArguments(
output_dir='./models/absa-roberta',
num_train_epochs=5,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
learning_rate=2e-5,
warmup_ratio=0.1,
weight_decay=0.01,
evaluation_strategy='epoch',
save_strategy='epoch',
load_best_model_at_end=True,
metric_for_best_model='f1_macro',
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
compute_metrics=compute_metrics,
)
trainer.train()
```
**Expected results on SemEval 2014 Restaurant:**
- Accuracy: ~83-87%
- Macro F1: ~75-80%
- These are competitive with published baselines (~85% accuracy)
> [!TIP]
> **Where to train:** Use Google Colab (free T4 GPU). Training takes ~15-20 minutes for 5 epochs on SemEval-sized data.
---
### Step 3: `src/evaluate.py` β€” Evaluation & Metrics
**What it does:** Generates classification report, confusion matrix, per-aspect performance.
**Key metrics to compute:**
```python
from sklearn.metrics import classification_report, confusion_matrix
# Overall metrics
print(classification_report(y_true, y_pred,
target_names=['positive', 'negative', 'neutral', 'conflict']))
# Per-aspect accuracy
for aspect in ASPECT_CATEGORIES:
mask = (test_df['aspect'] == aspect)
aspect_acc = accuracy_score(y_true[mask], y_pred[mask])
print(f"{aspect}: {aspect_acc:.2%}")
```
**What to save for resume:**
- Overall accuracy and macro F1
- Per-aspect F1 (shows where model is strong/weak)
- Comparison vs. a baseline (e.g., TF-IDF + Logistic Regression)
---
### Step 4: `src/inference.py` β€” Single Review Prediction
**What it does:** Takes one review, runs it through the model for ALL aspects, returns structured output.
```python
def predict_aspects(review_text: str, model, tokenizer):
results = {}
for aspect in ASPECT_CATEGORIES:
inputs = tokenizer(review_text, aspect,
truncation=True, padding='max_length',
max_length=128, return_tensors='pt')
outputs = model(**inputs.to(device))
probs = torch.softmax(outputs.logits, dim=1)
pred_label = torch.argmax(probs).item()
confidence = probs[0][pred_label].item()
# Only include if confidence > threshold
if confidence > 0.6:
results[aspect] = {
'sentiment': LABELS[pred_label],
'confidence': round(confidence, 3)
}
return results
```
**Example output:**
```json
{
"food": {"sentiment": "positive", "confidence": 0.94},
"service": {"sentiment": "negative", "confidence": 0.91}
}
```
---
### Step 5: `src/build_vectorstore.py` β€” Embed Reviews into FAISS
**What it does:** Takes all reviews, generates sentence embeddings, stores in FAISS for RAG retrieval.
```python
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import pickle
# Load embedding model
embedder = SentenceTransformer('all-MiniLM-L6-v2')
# Embed all reviews
review_texts = df['text'].unique().tolist()
embeddings = embedder.encode(review_texts, show_progress_bar=True)
# Build FAISS index
dimension = embeddings.shape[1] # 384 for MiniLM
index = faiss.IndexFlatIP(dimension)
faiss.normalize_L2(embeddings)
index.add(embeddings.astype('float32'))
# Save
faiss.write_index(index, 'vectorstore/reviews.index')
with open('vectorstore/review_texts.pkl', 'wb') as f:
pickle.dump(review_texts, f)
```
---
### Step 6: `src/rag_engine.py` β€” RAG Q&A Engine
**What it does:** User asks a question β†’ retrieves relevant reviews β†’ LLM synthesizes answer.
```python
def answer_question(question: str, top_k: int = 10):
# 1. Embed the question
q_embedding = embedder.encode([question])
faiss.normalize_L2(q_embedding)
# 2. Search FAISS
scores, indices = index.search(q_embedding.astype('float32'), top_k)
retrieved_reviews = [review_texts[i] for i in indices[0]]
# 3. Run ABSA on each retrieved review
aspect_results = []
for review in retrieved_reviews:
aspects = predict_aspects(review, model, tokenizer)
aspect_results.append({'text': review, 'aspects': aspects})
# 4. Send to LLM for synthesis
context = "\n".join([
f"Review: {r['text']}\nAspects: {r['aspects']}"
for r in aspect_results
])
prompt = f"""Based on these customer reviews and their aspect sentiments:
{context}
Question: {question}
Provide a concise, data-backed answer with specific counts and percentages."""
# Call Groq/OpenAI
response = llm.chat.completions.create(
model="llama-3.1-8b-instant", # or gpt-3.5-turbo
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
```
---
### Step 7: `api/main.py` β€” FastAPI Backend
**Endpoints:**
| Endpoint | Method | What It Does |
|---|---|---|
| `/predict` | POST | Takes a review β†’ returns aspect sentiments |
| `/ask` | POST | Takes a question β†’ returns RAG answer |
| `/stats` | GET | Returns aggregate sentiment stats per aspect |
| `/health` | GET | Health check |
```python
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title="Review Intelligence API")
class ReviewInput(BaseModel):
text: str
class QuestionInput(BaseModel):
question: str
@app.post("/predict")
def predict(review: ReviewInput):
results = predict_aspects(review.text, model, tokenizer)
return {"review": review.text, "aspects": results}
@app.post("/ask")
def ask(q: QuestionInput):
answer = answer_question(q.question)
return {"question": q.question, "answer": answer}
@app.get("/stats")
def get_stats():
# Return pre-computed aggregate stats
return aggregate_sentiment_stats
```
---
### Step 8: `app/streamlit_app.py` β€” Dashboard
**3 main panels:**
**Panel 1 β€” Live Analysis:**
- Text input: paste any review
- Click "Analyze" β†’ shows aspect sentiment cards with color coding
- Green = positive, Red = negative, Gray = neutral
**Panel 2 β€” Aggregate Dashboard:**
- Aspect sentiment heatmap (aspects Γ— sentiment, color intensity = count)
- Trend chart showing sentiment over time per aspect (if reviews have timestamps)
- Bar chart: "Top 5 complaints" and "Top 5 praises"
**Panel 3 β€” Q&A:**
- Text input: "What do customers say about food quality?"
- Returns LLM-synthesized answer with source reviews shown below
---
## requirements.txt
```
torch>=2.0
transformers>=4.35
datasets>=2.14
sentence-transformers>=2.2
faiss-cpu>=1.7
scikit-learn>=1.3
pandas>=2.0
numpy>=1.24
fastapi>=0.104
uvicorn>=0.24
streamlit>=1.28
plotly>=5.17
groq>=0.4
python-dotenv>=1.0
```
---
## Build Order (Do This Sequence)
| Step | File | Time Estimate | Dependency |
|---|---|---|---|
| 1 | Download SemEval data | 10 min | None |
| 2 | `src/data_processing.py` | 30 min | Step 1 |
| 3 | `src/train_absa.py` | 2-3 hours | Step 2 |
| 4 | `src/evaluate.py` | 30 min | Step 3 |
| 5 | `src/inference.py` | 30 min | Step 3 |
| 6 | `src/build_vectorstore.py` | 20 min | Step 2 |
| 7 | `src/rag_engine.py` | 1 hour | Steps 5+6 |
| 8 | `api/main.py` | 1 hour | Steps 5+7 |
| 9 | `app/streamlit_app.py` | 2-3 hours | Step 8 |
| 10 | `Dockerfile` + `docker-compose.yml` | 30 min | Step 8 |
| 11 | `README.md` | 1 hour | All |
**Total estimated time: 10-12 hours of focused work.**
---
## Resume Bullets (Draft)
> **Aspect-Based Review Intelligence System** | PyTorch, RoBERTa, FAISS, FastAPI, Streamlit
> GitHub | Live Demo
>
> β€’ Fine-tuned RoBERTa on SemEval-2014 benchmark for aspect-based sentiment analysis, achieving [X]% macro F1 across 5 aspect categories (food, service, ambiance, price), outperforming TF-IDF + Logistic Regression baseline by [Y]%.
>
> β€’ Built RAG-powered Q&A layer using sentence-transformers and FAISS over 3,000+ annotated reviews, enabling natural language queries like "why do customers complain about service?" with LLM-synthesized answers.
>
> β€’ Deployed as a full-stack application with FastAPI backend and Streamlit dashboard featuring real-time aspect sentiment analysis, aggregate heatmaps, and conversational Q&A interface.
---
## Interview Questions You Must Prepare For
| Question | What They're Testing |
|---|---|
| Why RoBERTa over BERT? | Do you understand model differences? (RoBERTa = better training, no NSP, more data) |
| Why sentence-pair format for ABSA? | Do you understand how to reformulate tasks for transformers? |
| What's the difference between aspect term extraction and aspect category detection? | NLP depth |
| How would you handle aspects not in the 5 categories? | Can you think beyond the training data? |
| Why FAISS over ChromaDB? | Do you understand trade-offs? (FAISS = speed, Chroma = ease) |
| How do you handle reviews with conflicting sentiments? | The "conflict" label β€” do you understand it? |