Spaces:

justhariharan
/

Review-RAG

Sleeping

Review-RAG / absa_project_plan.md.resolved

HariHaran9597

Initial commit

1d70196 about 2 months ago

15.6 kB

	# Aspect-Based Review Intelligence System — Complete Build Plan

	> Goal: Fine-tune RoBERTa for aspect-based sentiment analysis + RAG Q&A layer + Streamlit dashboard.
	> Resume line: NLP + Transformer Fine-tuning + RAG + FastAPI + Deployed

	---

	## Project Architecture

	```mermaid
	graph LR
	A[Raw Reviews] --> B[Data Processing]
	B --> C[Fine-tune RoBERTa]
	C --> D[ABSA Model]
	B --> E[Embed Reviews]
	E --> F[FAISS Vector Store]
	D --> G[FastAPI Backend]
	F --> G
	G --> H[Streamlit Dashboard]
	H --> I1[Aspect Sentiment Heatmap]
	H --> I2[Trend Charts]
	H --> I3[Natural Language Q&A]
	```

	---

	## What the System Does

	```
	INPUT: "The pizza was amazing but the waiter was incredibly rude and slow"

	OUTPUT (ABSA Model):
	├── food → Positive (confidence: 0.94)
	├── service → Negative (confidence: 0.91)
	├── ambiance → No mention
	└── price → No mention

	OUTPUT (RAG Q&A):
	User: "Why do customers complain about service?"
	System: "Based on 847 reviews, the top service complaints are:
	1. Slow wait times (mentioned in 34% of negative reviews)
	2. Rude staff behavior (28%)
	3. Order mistakes (19%)"
	```

	---

	## Dataset: SemEval 2014 Task 4

	\| Detail \| Value \|
	\|---\|---\|
	\| Name \| SemEval-2014 Task 4: Aspect-Based Sentiment Analysis \|
	\| Domain \| Restaurant reviews (also has Laptop — use Restaurant) \|
	\| Size \| ~3,000 training sentences, ~800 test sentences \|
	\| Labels \| Aspect categories: `food`, `service`, `ambiance`, `price`, `anecdotes/miscellaneous` \|
	\| Sentiments \| `positive`, `negative`, `neutral`, `conflict` \|
	\| Format \| XML \|
	\| Why this dataset \| Standard academic benchmark. Any interviewer who knows NLP will recognize it. Your results are directly comparable to published papers. \|

	> [!IMPORTANT]
	> Download link: [SemEval 2014 Task 4 Dataset](https://alt.qcri.org/semeval2014/task4/index.php?id=data-and-tools)
	> Download the Restaurant train + test XML files.

	---

	## Complete File Structure

	```
	review-intelligence/
	│
	├── data/
	│ ├── raw/ # SemEval XML files go here
	│ │ ├── Restaurants_Train_v2.xml
	│ │ └── Restaurants_Test_Gold.xml
	│ └── processed/ # Generated CSVs
	│ ├── train.csv
	│ └── test.csv
	│
	├── src/
	│ ├── data_processing.py # Step 1: Parse XML → CSV
	│ ├── train_absa.py # Step 2: Fine-tune RoBERTa
	│ ├── evaluate.py # Step 3: Evaluate model
	│ ├── inference.py # Step 4: Single-review prediction
	│ ├── build_vectorstore.py # Step 5: Embed reviews → FAISS
	│ └── rag_engine.py # Step 6: RAG retrieval + LLM answer
	│
	├── api/
	│ └── main.py # Step 7: FastAPI backend
	│
	├── app/
	│ └── streamlit_app.py # Step 8: Dashboard
	│
	├── models/ # Saved fine-tuned model
	├── vectorstore/ # FAISS index files
	│
	├── requirements.txt
	├── Dockerfile
	├── docker-compose.yml
	├── .env # API keys (Groq/OpenAI)
	└── README.md
	```

	Total files to write: 12 (8 Python + 4 config/docs)

	---

	## Step-by-Step Implementation

	---

	### Step 1: `src/data_processing.py` — Parse SemEval XML to CSV

	What it does: Reads the XML format, extracts each sentence + aspect category + sentiment, outputs a clean CSV.

	Input XML format:
	```xml
	<sentence id="1">
	<text>The pizza was amazing but the waiter was rude.</text>
	<aspectCategories>
	<aspectCategory category="food" polarity="positive"/>
	<aspectCategory category="service" polarity="negative"/>
	</aspectCategories>
	</sentence>
	```

	Output CSV format:

	\| text \| aspect \| sentiment \|
	\|---\|---\|---\|
	\| The pizza was amazing but the waiter was rude. \| food \| positive \|
	\| The pizza was amazing but the waiter was rude. \| service \| negative \|

	Key logic:
	```python
	import xml.etree.ElementTree as ET
	import pandas as pd

	def parse_semeval_xml(xml_path):
	tree = ET.parse(xml_path)
	root = tree.getroot()
	rows = []
	for sentence in root.findall('.//sentence'):
	text = sentence.find('text').text
	for aspect_cat in sentence.findall('.//aspectCategory'):
	rows.append({
	'text': text,
	'aspect': aspect_cat.get('category'),
	'sentiment': aspect_cat.get('polarity')
	})
	return pd.DataFrame(rows)
	```

	Label mapping:
	```python
	SENTIMENT_MAP = {'positive': 0, 'negative': 1, 'neutral': 2, 'conflict': 3}
	ASPECT_CATEGORIES = ['food', 'service', 'ambiance', 'price', 'anecdotes/miscellaneous']
	```

	---

	### Step 2: `src/train_absa.py` — Fine-tune RoBERTa

	What it does: Fine-tunes `roberta-base` for aspect-based sentiment classification.

	Model approach: Auxiliary Sentence Pair Classification
	- Input to model: `"[CLS] The pizza was amazing but waiter was rude [SEP] food [SEP]"`
	- Output: `positive` (for the "food" aspect)
	- This converts ABSA into a standard sentence-pair classification task that RoBERTa handles natively.

	Key implementation details:

	```python
	# Tokenization — sentence pair format
	# Sentence A = review text
	# Sentence B = aspect category name
	inputs = tokenizer(
	review_text, # "The pizza was amazing..."
	aspect_category, # "food"
	truncation=True,
	padding='max_length',
	max_length=128,
	return_tensors='pt'
	)
	```

	Training config:

	\| Parameter \| Value \| Why \|
	\|---\|---\|---\|
	\| Base model \| `roberta-base` \| Best balance of size vs accuracy for this task \|
	\| Learning rate \| `2e-5` \| Standard for transformer fine-tuning \|
	\| Batch size \| `16` \| Fits in ~6GB GPU / free Colab \|
	\| Epochs \| `5` \| SemEval is small; more epochs = overfitting \|
	\| Max length \| `128` \| Restaurant reviews are short \|
	\| Optimizer \| AdamW \| Standard for transformers \|
	\| Scheduler \| Linear warmup (10% steps) \| Prevents early instability \|
	\| Loss \| CrossEntropyLoss \| 4-class classification \|

	Libraries needed:
	```python
	from transformers import RobertaTokenizer, RobertaForSequenceClassification
	from transformers import Trainer, TrainingArguments
	from datasets import Dataset
	from sklearn.model_selection import train_test_split
	```

	Training loop (using HuggingFace Trainer):
	```python
	model = RobertaForSequenceClassification.from_pretrained(
	'roberta-base', num_labels=4 # pos, neg, neutral, conflict
	)

	training_args = TrainingArguments(
	output_dir='./models/absa-roberta',
	num_train_epochs=5,
	per_device_train_batch_size=16,
	per_device_eval_batch_size=32,
	learning_rate=2e-5,
	warmup_ratio=0.1,
	weight_decay=0.01,
	evaluation_strategy='epoch',
	save_strategy='epoch',
	load_best_model_at_end=True,
	metric_for_best_model='f1_macro',
	)

	trainer = Trainer(
	model=model,
	args=training_args,
	train_dataset=train_dataset,
	eval_dataset=val_dataset,
	compute_metrics=compute_metrics,
	)
	trainer.train()
	```

	Expected results on SemEval 2014 Restaurant:
	- Accuracy: ~83-87%
	- Macro F1: ~75-80%
	- These are competitive with published baselines (~85% accuracy)

	> [!TIP]
	> Where to train: Use Google Colab (free T4 GPU). Training takes ~15-20 minutes for 5 epochs on SemEval-sized data.

	---

	### Step 3: `src/evaluate.py` — Evaluation & Metrics

	What it does: Generates classification report, confusion matrix, per-aspect performance.

	Key metrics to compute:
	```python
	from sklearn.metrics import classification_report, confusion_matrix

	# Overall metrics
	print(classification_report(y_true, y_pred,
	target_names=['positive', 'negative', 'neutral', 'conflict']))

	# Per-aspect accuracy
	for aspect in ASPECT_CATEGORIES:
	mask = (test_df['aspect'] == aspect)
	aspect_acc = accuracy_score(y_true[mask], y_pred[mask])
	print(f"{aspect}: {aspect_acc:.2%}")
	```

	What to save for resume:
	- Overall accuracy and macro F1
	- Per-aspect F1 (shows where model is strong/weak)
	- Comparison vs. a baseline (e.g., TF-IDF + Logistic Regression)

	---

	### Step 4: `src/inference.py` — Single Review Prediction

	What it does: Takes one review, runs it through the model for ALL aspects, returns structured output.

	```python
	def predict_aspects(review_text: str, model, tokenizer):
	results = {}
	for aspect in ASPECT_CATEGORIES:
	inputs = tokenizer(review_text, aspect,
	truncation=True, padding='max_length',
	max_length=128, return_tensors='pt')
	outputs = model(**inputs.to(device))
	probs = torch.softmax(outputs.logits, dim=1)
	pred_label = torch.argmax(probs).item()
	confidence = probs[0][pred_label].item()

	# Only include if confidence > threshold
	if confidence > 0.6:
	results[aspect] = {
	'sentiment': LABELS[pred_label],
	'confidence': round(confidence, 3)
	}
	return results
	```

	Example output:
	```json
	{
	"food": {"sentiment": "positive", "confidence": 0.94},
	"service": {"sentiment": "negative", "confidence": 0.91}
	}
	```

	---

	### Step 5: `src/build_vectorstore.py` — Embed Reviews into FAISS

	What it does: Takes all reviews, generates sentence embeddings, stores in FAISS for RAG retrieval.

	```python
	from sentence_transformers import SentenceTransformer
	import faiss
	import numpy as np
	import pickle

	# Load embedding model
	embedder = SentenceTransformer('all-MiniLM-L6-v2')

	# Embed all reviews
	review_texts = df['text'].unique().tolist()
	embeddings = embedder.encode(review_texts, show_progress_bar=True)

	# Build FAISS index
	dimension = embeddings.shape[1] # 384 for MiniLM
	index = faiss.IndexFlatIP(dimension)
	faiss.normalize_L2(embeddings)
	index.add(embeddings.astype('float32'))

	# Save
	faiss.write_index(index, 'vectorstore/reviews.index')
	with open('vectorstore/review_texts.pkl', 'wb') as f:
	pickle.dump(review_texts, f)
	```

	---

	### Step 6: `src/rag_engine.py` — RAG Q&A Engine

	What it does: User asks a question → retrieves relevant reviews → LLM synthesizes answer.

	```python
	def answer_question(question: str, top_k: int = 10):
	# 1. Embed the question
	q_embedding = embedder.encode([question])
	faiss.normalize_L2(q_embedding)

	# 2. Search FAISS
	scores, indices = index.search(q_embedding.astype('float32'), top_k)
	retrieved_reviews = [review_texts[i] for i in indices[0]]

	# 3. Run ABSA on each retrieved review
	aspect_results = []
	for review in retrieved_reviews:
	aspects = predict_aspects(review, model, tokenizer)
	aspect_results.append({'text': review, 'aspects': aspects})

	# 4. Send to LLM for synthesis
	context = "\n".join([
	f"Review: {r['text']}\nAspects: {r['aspects']}"
	for r in aspect_results
	])

	prompt = f"""Based on these customer reviews and their aspect sentiments:

	{context}

	Question: {question}
	Provide a concise, data-backed answer with specific counts and percentages."""

	# Call Groq/OpenAI
	response = llm.chat.completions.create(
	model="llama-3.1-8b-instant", # or gpt-3.5-turbo
	messages=[{"role": "user", "content": prompt}]
	)
	return response.choices[0].message.content
	```

	---

	### Step 7: `api/main.py` — FastAPI Backend

	Endpoints:

	\| Endpoint \| Method \| What It Does \|
	\|---\|---\|---\|
	\| `/predict` \| POST \| Takes a review → returns aspect sentiments \|
	\| `/ask` \| POST \| Takes a question → returns RAG answer \|
	\| `/stats` \| GET \| Returns aggregate sentiment stats per aspect \|
	\| `/health` \| GET \| Health check \|

	```python
	from fastapi import FastAPI
	from pydantic import BaseModel

	app = FastAPI(title="Review Intelligence API")

	class ReviewInput(BaseModel):
	text: str

	class QuestionInput(BaseModel):
	question: str

	@app.post("/predict")
	def predict(review: ReviewInput):
	results = predict_aspects(review.text, model, tokenizer)
	return {"review": review.text, "aspects": results}

	@app.post("/ask")
	def ask(q: QuestionInput):
	answer = answer_question(q.question)
	return {"question": q.question, "answer": answer}

	@app.get("/stats")
	def get_stats():
	# Return pre-computed aggregate stats
	return aggregate_sentiment_stats
	```

	---

	### Step 8: `app/streamlit_app.py` — Dashboard

	3 main panels:

	Panel 1 — Live Analysis:
	- Text input: paste any review
	- Click "Analyze" → shows aspect sentiment cards with color coding
	- Green = positive, Red = negative, Gray = neutral

	Panel 2 — Aggregate Dashboard:
	- Aspect sentiment heatmap (aspects × sentiment, color intensity = count)
	- Trend chart showing sentiment over time per aspect (if reviews have timestamps)
	- Bar chart: "Top 5 complaints" and "Top 5 praises"

	Panel 3 — Q&A:
	- Text input: "What do customers say about food quality?"
	- Returns LLM-synthesized answer with source reviews shown below

	---

	## requirements.txt

	```
	torch>=2.0
	transformers>=4.35
	datasets>=2.14
	sentence-transformers>=2.2
	faiss-cpu>=1.7
	scikit-learn>=1.3
	pandas>=2.0
	numpy>=1.24
	fastapi>=0.104
	uvicorn>=0.24
	streamlit>=1.28
	plotly>=5.17
	groq>=0.4
	python-dotenv>=1.0
	```

	---

	## Build Order (Do This Sequence)

	\| Step \| File \| Time Estimate \| Dependency \|
	\|---\|---\|---\|---\|
	\| 1 \| Download SemEval data \| 10 min \| None \|
	\| 2 \| `src/data_processing.py` \| 30 min \| Step 1 \|
	\| 3 \| `src/train_absa.py` \| 2-3 hours \| Step 2 \|
	\| 4 \| `src/evaluate.py` \| 30 min \| Step 3 \|
	\| 5 \| `src/inference.py` \| 30 min \| Step 3 \|
	\| 6 \| `src/build_vectorstore.py` \| 20 min \| Step 2 \|
	\| 7 \| `src/rag_engine.py` \| 1 hour \| Steps 5+6 \|
	\| 8 \| `api/main.py` \| 1 hour \| Steps 5+7 \|
	\| 9 \| `app/streamlit_app.py` \| 2-3 hours \| Step 8 \|
	\| 10 \| `Dockerfile` + `docker-compose.yml` \| 30 min \| Step 8 \|
	\| 11 \| `README.md` \| 1 hour \| All \|

	Total estimated time: 10-12 hours of focused work.

	---

	## Resume Bullets (Draft)

	> Aspect-Based Review Intelligence System \| PyTorch, RoBERTa, FAISS, FastAPI, Streamlit
	> GitHub \| Live Demo
	>
	> • Fine-tuned RoBERTa on SemEval-2014 benchmark for aspect-based sentiment analysis, achieving [X]% macro F1 across 5 aspect categories (food, service, ambiance, price), outperforming TF-IDF + Logistic Regression baseline by [Y]%.
	>
	> • Built RAG-powered Q&A layer using sentence-transformers and FAISS over 3,000+ annotated reviews, enabling natural language queries like "why do customers complain about service?" with LLM-synthesized answers.
	>
	> • Deployed as a full-stack application with FastAPI backend and Streamlit dashboard featuring real-time aspect sentiment analysis, aggregate heatmaps, and conversational Q&A interface.

	---

	## Interview Questions You Must Prepare For

	\| Question \| What They're Testing \|
	\|---\|---\|
	\| Why RoBERTa over BERT? \| Do you understand model differences? (RoBERTa = better training, no NSP, more data) \|
	\| Why sentence-pair format for ABSA? \| Do you understand how to reformulate tasks for transformers? \|
	\| What's the difference between aspect term extraction and aspect category detection? \| NLP depth \|
	\| How would you handle aspects not in the 5 categories? \| Can you think beyond the training data? \|
	\| Why FAISS over ChromaDB? \| Do you understand trade-offs? (FAISS = speed, Chroma = ease) \|
	\| How do you handle reviews with conflicting sentiments? \| The "conflict" label — do you understand it? \|