parthnuwal7 commited on
Commit
f71c767
·
1 Parent(s): ee7ef03

ROLLBACK: Restore PyABSA approach for high accuracy

Browse files

Restored PyABSA implementation in data_processor.py
Updated requirements for ML dependencies (torch, transformers, pyabsa)
Fixed requirements-docker.txt for HF Spaces deployment
Updated secrets template for PyABSA approach
Created comprehensive PyABSA deployment guide

Rationale: Higher accuracy needed than HF API transformers
Strategy: HF Spaces backend + Streamlit Cloud frontend
Next: Deploy backend to HF Spaces with PyABSA models

.streamlit/secrets.toml.template CHANGED
@@ -1,11 +1,10 @@
1
  # Streamlit Cloud Secrets Template
2
- # Copy this to your Streamlit Cloud app secrets
3
 
4
- # Hugging Face API Token for Inference API
5
- HF_TOKEN = "hf_your_token_here"
6
 
7
  # Instructions:
8
- # 1. Get your HF token from https://huggingface.co/settings/tokens
9
- # 2. Create a token with "Read" permissions
10
- # 3. Copy the token and paste it above
11
- # 4. In Streamlit Cloud: Go to app settings > Secrets > Paste this content
 
1
  # Streamlit Cloud Secrets Template
2
+ # Copy this to your Streamlit Cloud app secrets if needed
3
 
4
+ # Optional: Hugging Face token for additional model downloads
5
+ # HF_TOKEN = "hf_your_token_here"
6
 
7
  # Instructions:
8
+ # - PyABSA models will be downloaded automatically on first run
9
+ # - HF_TOKEN is optional and only needed for restricted models
10
+ # - Leave this file empty if no special tokens are needed
 
PYABSA_DEPLOYMENT.md ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🚀 PyABSA Deployment Guide: HF Spaces + Streamlit Cloud
2
+
3
+ ## Overview
4
+
5
+ This guide covers deploying the high-accuracy PyABSA sentiment analysis application using a hybrid approach:
6
+ - **Backend**: HF Spaces (Docker) for PyABSA processing
7
+ - **Frontend**: Streamlit Cloud for the user interface
8
+
9
+ ## Why This Approach?
10
+
11
+ ✅ **High Accuracy**: PyABSA provides superior sentiment analysis compared to API-based solutions
12
+ ✅ **Reliability**: Local model processing eliminates API dependencies
13
+ ✅ **Scalability**: HF Spaces handles the heavy ML workload
14
+ ✅ **User Experience**: Streamlit Cloud provides fast frontend deployment
15
+
16
+ ## Architecture
17
+
18
+ ```
19
+ User → Streamlit Cloud (Frontend) → HF Spaces (PyABSA Backend) → Results
20
+ ```
21
+
22
+ ## Deployment Steps
23
+
24
+ ### Phase 1: Deploy Backend to HF Spaces
25
+
26
+ 1. **Push to HF Spaces Repository**
27
+ ```bash
28
+ git push origin main
29
+ ```
30
+
31
+ 2. **Configure HF Spaces**
32
+ - Go to your HF Spaces settings
33
+ - Set the app type to "Docker"
34
+ - Hardware: CPU Basic (16GB RAM recommended for PyABSA)
35
+ - Dockerfile: Uses `requirements-docker.txt`
36
+
37
+ 3. **Monitor Deployment**
38
+ - First deployment takes 10-15 minutes (model downloads)
39
+ - Watch logs for PyABSA model loading
40
+ - Verify ABSA functionality works
41
+
42
+ ### Phase 2: Create Streamlit Cloud Frontend
43
+
44
+ 1. **Create Separate Frontend Repository**
45
+ ```bash
46
+ # Create a new repo for frontend-only version
47
+ git clone https://github.com/yourusername/your-repo.git frontend-app
48
+ cd frontend-app
49
+ ```
50
+
51
+ 2. **Modify for API Connection**
52
+ - Update `app_enhanced.py` to connect to HF Spaces backend
53
+ - Replace local processing with API calls to HF Spaces
54
+ - Keep all visualizations and UI components
55
+
56
+ 3. **Deploy to Streamlit Cloud**
57
+ - Connect GitHub repository
58
+ - Use lightweight `requirements.txt` (no PyABSA/torch)
59
+ - Set environment variables for HF Spaces API endpoint
60
+
61
+ ## Configuration Files
62
+
63
+ ### HF Spaces Configuration
64
+
65
+ **`requirements-docker.txt`** (Heavy ML dependencies):
66
+ ```
67
+ torch>=2.0.0,<2.2.0
68
+ transformers>=4.30.0,<4.37.0
69
+ pyabsa>=2.4.0,<3.0.0
70
+ sentencepiece>=0.1.99
71
+ sacremoses>=0.0.53
72
+ faiss-cpu>=1.7.4
73
+ # ... other dependencies
74
+ ```
75
+
76
+ **`Dockerfile`** (Optimized for PyABSA):
77
+ - Python 3.11 slim base
78
+ - Proper cache directories for transformers
79
+ - Non-root user for security
80
+ - Port 7860 for HF Spaces
81
+
82
+ ### Streamlit Cloud Configuration
83
+
84
+ **`requirements.txt`** (Lightweight frontend):
85
+ ```
86
+ streamlit>=1.28.0
87
+ pandas>=1.5.0
88
+ plotly>=5.15.0
89
+ requests>=2.31.0
90
+ # No torch/transformers/pyabsa
91
+ ```
92
+
93
+ ## Troubleshooting
94
+
95
+ ### Common HF Spaces Issues
96
+
97
+ 1. **Model Download Timeout**
98
+ - Solution: Use CPU Basic with 16GB RAM
99
+ - Monitor logs for download progress
100
+
101
+ 2. **Memory Issues**
102
+ - Solution: Upgrade to better hardware tier
103
+ - Optimize model loading in data_processor.py
104
+
105
+ 3. **File Upload Issues**
106
+ - Solution: Check Dockerfile permissions
107
+ - Ensure data directories are writable
108
+
109
+ ### Common Streamlit Cloud Issues
110
+
111
+ 1. **API Connection Failures**
112
+ - Verify HF Spaces URL is correct
113
+ - Check network connectivity
114
+ - Add retry logic for API calls
115
+
116
+ 2. **Dependency Conflicts**
117
+ - Keep frontend requirements minimal
118
+ - Only include UI and API libraries
119
+
120
+ ## Performance Optimization
121
+
122
+ ### HF Spaces Backend
123
+ - Use CPU-optimized PyTorch builds
124
+ - Implement model caching
125
+ - Add request batching for multiple reviews
126
+
127
+ ### Streamlit Cloud Frontend
128
+ - Implement caching for API responses
129
+ - Use progress indicators for long operations
130
+ - Optimize chart rendering
131
+
132
+ ## Monitoring and Maintenance
133
+
134
+ ### Health Checks
135
+ - Monitor HF Spaces uptime
136
+ - Check model loading status
137
+ - Verify API endpoints respond correctly
138
+
139
+ ### Updates
140
+ 1. Deploy backend changes to HF Spaces first
141
+ 2. Test API compatibility
142
+ 3. Update frontend to match new API contract
143
+ 4. Deploy frontend changes to Streamlit Cloud
144
+
145
+ ## Cost Considerations
146
+
147
+ ### HF Spaces
148
+ - CPU Basic: ~$0.05/hour when running
149
+ - Automatic shutdown when inactive
150
+ - Pay only for usage
151
+
152
+ ### Streamlit Cloud
153
+ - Community tier: Free
154
+ - No resource limits for frontend-only apps
155
+
156
+ ## Security Notes
157
+
158
+ - No sensitive data stored in either platform
159
+ - File uploads processed securely
160
+ - No permanent data storage
161
+ - HTTPS encryption end-to-end
162
+
163
+ ## API Contract (Frontend ↔ Backend)
164
+
165
+ ### POST `/process-reviews`
166
+ ```json
167
+ {
168
+ "reviews": ["Review text 1", "Review text 2"],
169
+ "options": {
170
+ "translate": true,
171
+ "extract_aspects": true
172
+ }
173
+ }
174
+ ```
175
+
176
+ ### Response
177
+ ```json
178
+ {
179
+ "processed_data": {...},
180
+ "absa_details": [...],
181
+ "analytics": {...}
182
+ }
183
+ ```
184
+
185
+ ## Next Steps
186
+
187
+ 1. ✅ Deploy current version to HF Spaces
188
+ 2. ⚡ Create frontend-only version for Streamlit Cloud
189
+ 3. 🔗 Implement API communication layer
190
+ 4. 🚀 Test end-to-end functionality
191
+ 5. 📊 Monitor performance and optimize
192
+
193
+ ---
194
+
195
+ *This deployment strategy provides the best of both worlds: PyABSA's accuracy with cloud-native scalability.*
requirements-docker.txt CHANGED
@@ -1,7 +1,7 @@
1
  # Core ML and NLP Libraries
2
  torch>=2.0.0,<2.2.0
3
  transformers>=4.30.0,<4.37.0
4
- pyabsa>=2.4.0,<3.0.0 # Commented out due to HF Spaces compatibility issues
5
  sentencepiece>=0.1.99
6
  sacremoses>=0.0.53
7
  faiss-cpu>=1.7.4
 
1
  # Core ML and NLP Libraries
2
  torch>=2.0.0,<2.2.0
3
  transformers>=4.30.0,<4.37.0
4
+ pyabsa>=2.4.0,<3.0.0 # Restored for high accuracy ABSA
5
  sentencepiece>=0.1.99
6
  sacremoses>=0.0.53
7
  faiss-cpu>=1.7.4
requirements.txt CHANGED
@@ -1,6 +1,4 @@
1
- altair
2
- pandas
3
- # Streamlit Cloud Requirements - Optimized for API approach
4
  streamlit>=1.28.0
5
  pandas>=1.5.0
6
  numpy>=1.24.0
@@ -14,16 +12,15 @@ streamlit-option-menu>=0.3.6
14
  streamlit-aggrid>=0.3.4
15
  joblib>=1.3.0
16
  pillow>=10.0.0
17
- requests>=2.31.0
18
- faker>=18.0.0
19
  networkx>=3.0
20
  openpyxl>=3.1.0
21
  reportlab>=4.0.0
 
22
 
23
- # Removed heavy ML dependencies for API approach:
24
- # - torch (saved ~1GB)
25
- # - transformers (saved ~500MB)
26
- # - pyabsa (saved download issues)
27
- # - sentencepiece, sacremoses, faiss-cpu (not needed)
28
-
29
- # Using HF Inference API instead - much more reliable!
 
1
+ # Production Streamlit Requirements - PyABSA Enhanced ABSA
 
 
2
  streamlit>=1.28.0
3
  pandas>=1.5.0
4
  numpy>=1.24.0
 
12
  streamlit-aggrid>=0.3.4
13
  joblib>=1.3.0
14
  pillow>=10.0.0
 
 
15
  networkx>=3.0
16
  openpyxl>=3.1.0
17
  reportlab>=4.0.0
18
+ faker>=18.0.0
19
 
20
+ # Enhanced ML Dependencies for High Accuracy ABSA
21
+ torch>=1.13.0
22
+ transformers>=4.30.0
23
+ pyabsa>=2.4.0
24
+ sentencepiece>=0.1.99
25
+ sacremoses>=0.0.53
26
+ faiss-cpu>=1.7.4
src/utils/data_processor.py CHANGED
@@ -14,9 +14,6 @@ import streamlit as st
14
  from collections import Counter, defaultdict
15
  from itertools import combinations
16
  import networkx as nx
17
- import requests
18
- import os
19
- import time
20
 
21
  # Set up logging
22
  logging.basicConfig(level=logging.INFO)
@@ -82,59 +79,27 @@ class DataValidator:
82
 
83
 
84
  class TranslationService:
85
- """Handles translation using HF Inference API - much more reliable than local models."""
86
-
87
  def __init__(self):
88
- self.api_token = self._get_hf_token()
89
- self.translation_model = "facebook/m2m100_418M"
90
- self.base_url = "https://api-inference.huggingface.co/models"
91
- logger.info("Initialized HF Inference API for translation")
92
-
93
- def _get_hf_token(self) -> Optional[str]:
94
- """Get HF token from environment or Streamlit secrets."""
95
- try:
96
- return st.secrets["HF_TOKEN"]
97
- except:
98
- pass
99
-
100
- token = os.getenv("HF_TOKEN")
101
- if not token:
102
- logger.warning("No HF_TOKEN found. Translation will be limited.")
103
- return token
104
-
105
- def _call_hf_translation_api(self, text: str, source_lang: str = "hi", target_lang: str = "en") -> str:
106
- """Call HF Translation API with fallback."""
107
- if not self.api_token:
108
- logger.warning("No API token, skipping translation")
109
- return text
110
-
111
  try:
112
- headers = {"Authorization": f"Bearer {self.api_token}"}
113
- url = f"{self.base_url}/{self.translation_model}"
114
-
115
- # Format input for M2M100
116
- payload = {
117
- "inputs": text,
118
- "parameters": {
119
- "src_lang": source_lang,
120
- "tgt_lang": target_lang
121
- }
122
- }
123
-
124
- response = requests.post(url, headers=headers, json=payload, timeout=30)
125
-
126
- if response.status_code == 200:
127
- result = response.json()
128
- if isinstance(result, list) and len(result) > 0:
129
- return result[0].get("translation_text", text)
130
-
131
- logger.warning(f"Translation API failed: {response.status_code}")
132
- return text
133
-
134
  except Exception as e:
135
- logger.error(f"Translation error: {str(e)}")
136
- return text
137
-
138
  def detect_language(self, text: str) -> str:
139
  """Detect language of the text."""
140
  try:
@@ -142,230 +107,171 @@ class TranslationService:
142
  return lang
143
  except:
144
  return 'unknown'
145
-
146
  def translate_to_english(self, text: str, source_lang: str = 'hi') -> str:
147
  """
148
- Translate text to English using HF API.
149
-
150
  Args:
151
  text: Text to translate
152
  source_lang: Source language code
153
-
154
  Returns:
155
  Translated text
156
  """
157
- if source_lang == 'en' or source_lang == 'unknown':
158
  return text
159
-
160
- return self._call_hf_translation_api(text, source_lang, "en")
161
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
162
  def process_reviews(self, reviews: List[str]) -> Tuple[List[str], List[str]]:
163
  """
164
  Process list of reviews for translation.
165
-
166
  Args:
167
  reviews: List of review texts
168
-
169
  Returns:
170
  Tuple of (translated_reviews, detected_languages)
171
  """
172
  translated_reviews = []
173
  detected_languages = []
174
-
175
- for i, review in enumerate(reviews):
176
- if i % 20 == 0: # Progress logging
177
- logger.info(f"Processing translation {i+1}/{len(reviews)}")
178
-
179
  lang = self.detect_language(review)
180
  detected_languages.append(lang)
181
-
182
  if lang == 'hi': # Hindi detected
183
  translated = self.translate_to_english(review, 'hi')
184
  translated_reviews.append(translated)
185
  else:
186
  translated_reviews.append(review) # Keep original if not Hindi
187
-
188
  return translated_reviews, detected_languages
189
 
190
 
191
  class ABSAProcessor:
192
- """Enhanced ABSA using Hugging Face Inference API - much more reliable for production."""
193
-
194
  def __init__(self):
195
- self.api_token = self._get_hf_token()
196
- self.sentiment_model = "cardiffnlp/twitter-roberta-base-sentiment-latest"
197
- self.base_url = "https://api-inference.huggingface.co/models"
198
- logger.info("Initialized HF Inference API for ABSA processing")
199
-
200
- def _get_hf_token(self) -> Optional[str]:
201
- """Get HF token from environment or Streamlit secrets."""
202
- # Try Streamlit secrets first
203
  try:
204
- return st.secrets["HF_TOKEN"]
205
- except:
206
- pass
207
-
208
- # Try environment variable
209
- token = os.getenv("HF_TOKEN")
210
- if not token:
211
- logger.warning("No HF_TOKEN found. Some features may be limited.")
212
- return token
213
-
214
- def _call_hf_api(self, model_name: str, inputs: str, max_retries: int = 3) -> Dict:
215
- """Call HF Inference API with retry logic."""
216
- headers = {}
217
- if self.api_token:
218
- headers["Authorization"] = f"Bearer {self.api_token}"
219
-
220
- url = f"{self.base_url}/{model_name}"
221
- payload = {"inputs": inputs}
222
-
223
- for attempt in range(max_retries):
224
- try:
225
- response = requests.post(url, headers=headers, json=payload, timeout=30)
226
-
227
- if response.status_code == 503:
228
- # Model is loading, wait and retry
229
- wait_time = 2 ** attempt # Exponential backoff
230
- logger.info(f"Model loading, waiting {wait_time}s before retry...")
231
- time.sleep(wait_time)
232
- continue
233
-
234
- response.raise_for_status()
235
- return response.json()
236
-
237
- except requests.exceptions.RequestException as e:
238
- logger.error(f"API call failed (attempt {attempt + 1}): {str(e)}")
239
- if attempt == max_retries - 1:
240
- return {"error": str(e)}
241
- time.sleep(1)
242
-
243
- return {"error": "Max retries exceeded"}
244
-
 
 
 
 
 
 
 
 
 
 
 
 
245
  def extract_aspects_and_sentiments(self, reviews: List[str]) -> List[Dict[str, Any]]:
246
  """
247
- Extract aspects and sentiments using HF Inference API and rule-based aspects.
248
-
249
  Args:
250
  reviews: List of review texts
251
-
252
  Returns:
253
  List of dictionaries containing extracted information
254
  """
255
- logger.info(f"Processing {len(reviews)} reviews with HF Inference API")
256
-
257
- processed_results = []
258
-
259
- for i, review in enumerate(reviews):
260
- if i % 10 == 0: # Progress logging
261
- logger.info(f"Processing review {i+1}/{len(reviews)}")
262
-
263
- # Get sentiment from HF API
264
- sentiment = self._get_hf_sentiment(review)
265
-
266
- # Extract aspects using rule-based approach
267
- aspects = self._extract_simple_aspects(review)
268
-
269
- processed_result = {
270
- 'sentence': review,
271
- 'aspects': aspects,
272
- 'sentiments': [sentiment] * len(aspects),
273
- 'positions': [[0, len(review)]] * len(aspects),
274
- 'confidence_scores': [0.8] * len(aspects), # HF models are quite confident
275
- 'tokens': review.split(),
276
- 'iob_tags': ['O'] * len(review.split())
277
- }
278
- processed_results.append(processed_result)
279
-
280
- logger.info(f"Successfully processed {len(processed_results)} reviews")
281
- return processed_results
282
-
283
- def _get_hf_sentiment(self, text: str) -> str:
284
- """Get sentiment from HF Inference API with fallback."""
285
- if not self.api_token:
286
- # Fallback to rule-based if no API token
287
- return self._get_rule_based_sentiment(text)
288
-
289
  try:
290
- result = self._call_hf_api(self.sentiment_model, text)
291
-
292
- if "error" in result:
293
- logger.warning(f"API error, using rule-based fallback: {result['error']}")
294
- return self._get_rule_based_sentiment(text)
295
-
296
- # Parse HF sentiment result
297
- if isinstance(result, list) and len(result) > 0:
298
- predictions = result[0]
299
- if isinstance(predictions, list) and len(predictions) > 0:
300
- top_prediction = max(predictions, key=lambda x: x.get('score', 0))
301
- label = top_prediction.get('label', 'NEUTRAL').upper()
302
-
303
- # Map HF labels to our format
304
- if 'POSITIVE' in label or 'POS' in label:
305
- return 'Positive'
306
- elif 'NEGATIVE' in label or 'NEG' in label:
307
- return 'Negative'
308
- else:
309
- return 'Neutral'
310
-
311
- # Fallback if parsing fails
312
- return self._get_rule_based_sentiment(text)
313
-
314
  except Exception as e:
315
- logger.error(f"HF API error: {str(e)}, using rule-based fallback")
316
- return self._get_rule_based_sentiment(text)
317
-
318
- def _get_rule_based_sentiment(self, review: str) -> str:
319
- """Fallback rule-based sentiment analysis."""
320
- review_lower = review.lower()
321
-
322
- # Enhanced sentiment words
323
- positive_words = ['good', 'great', 'excellent', 'amazing', 'love', 'best', 'awesome',
324
- 'fantastic', 'wonderful', 'perfect', 'satisfied', 'happy', 'pleased',
325
- 'outstanding', 'brilliant', 'superb', 'delighted', 'impressed']
326
-
327
- negative_words = ['bad', 'terrible', 'awful', 'hate', 'worst', 'horrible', 'poor',
328
- 'disappointing', 'frustrated', 'angry', 'broken', 'failed', 'useless',
329
- 'pathetic', 'disgusting', 'annoying', 'waste', 'regret']
330
-
331
- pos_count = sum(1 for word in positive_words if word in review_lower)
332
- neg_count = sum(1 for word in negative_words if word in review_lower)
333
-
334
- if pos_count > neg_count:
335
- return 'Positive'
336
- elif neg_count > pos_count:
337
- return 'Negative'
338
- else:
339
- return 'Neutral'
340
-
341
- def _extract_simple_aspects(self, review: str) -> List[str]:
342
- """Extract aspects using enhanced keyword matching."""
343
- review_lower = review.lower()
344
- aspects = []
345
-
346
- # Enhanced aspect keywords
347
- aspect_keywords = {
348
- 'Quality': ['quality', 'build', 'material', 'construction', 'durability', 'solid', 'sturdy', 'cheap', 'flimsy'],
349
- 'Price': ['price', 'cost', 'expensive', 'cheap', 'value', 'money', 'affordable', 'budget', 'worth'],
350
- 'Service': ['service', 'support', 'help', 'staff', 'customer', 'response', 'team', 'representative'],
351
- 'Delivery': ['delivery', 'shipping', 'fast', 'quick', 'slow', 'delayed', 'arrive', 'package'],
352
- 'Design': ['design', 'look', 'appearance', 'beautiful', 'ugly', 'style', 'color', 'aesthetic'],
353
- 'Performance': ['performance', 'speed', 'fast', 'slow', 'efficiency', 'works', 'function', 'smooth'],
354
- 'Usability': ['easy', 'difficult', 'user', 'interface', 'intuitive', 'complex', 'simple', 'confusing'],
355
- 'Features': ['feature', 'function', 'capability', 'option', 'setting', 'mode', 'tool'],
356
- 'Size': ['size', 'big', 'small', 'large', 'compact', 'tiny', 'huge', 'dimension'],
357
- 'Battery': ['battery', 'charge', 'power', 'energy', 'last', 'drain', 'life']
358
- }
359
-
360
- for aspect, keywords in aspect_keywords.items():
361
- if any(keyword in review_lower for keyword in keywords):
362
- aspects.append(aspect)
363
-
364
- # Default aspect if none found
365
- if not aspects:
366
- aspects = ['General']
367
-
368
- return aspects
369
 
370
 
371
 
 
14
  from collections import Counter, defaultdict
15
  from itertools import combinations
16
  import networkx as nx
 
 
 
17
 
18
  # Set up logging
19
  logging.basicConfig(level=logging.INFO)
 
79
 
80
 
81
  class TranslationService:
82
+ """Handles translation from Hindi to English using M2M100."""
83
+
84
  def __init__(self):
85
+ self.model = None
86
+ self.tokenizer = None
87
+ self._load_model()
88
+
89
+ def _load_model(self):
90
+ """Load M2M100 model for translation."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  try:
92
+ from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
93
+
94
+ model_name = "facebook/m2m100_418M"
95
+ self.tokenizer = M2M100Tokenizer.from_pretrained(model_name)
96
+ self.model = M2M100ForConditionalGeneration.from_pretrained(model_name)
97
+
98
+ logger.info("Translation model loaded successfully")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
  except Exception as e:
100
+ logger.error(f"Error loading translation model: {str(e)}")
101
+ st.error(f"Failed to load translation model: {str(e)}")
102
+
103
  def detect_language(self, text: str) -> str:
104
  """Detect language of the text."""
105
  try:
 
107
  return lang
108
  except:
109
  return 'unknown'
110
+
111
  def translate_to_english(self, text: str, source_lang: str = 'hi') -> str:
112
  """
113
+ Translate text to English.
114
+
115
  Args:
116
  text: Text to translate
117
  source_lang: Source language code
118
+
119
  Returns:
120
  Translated text
121
  """
122
+ if not self.model or not self.tokenizer:
123
  return text
124
+
125
+ try:
126
+ # Set source language
127
+ self.tokenizer.src_lang = source_lang
128
+
129
+ # Encode and translate
130
+ encoded = self.tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
131
+
132
+ # Generate translation
133
+ generated_tokens = self.model.generate(
134
+ **encoded,
135
+ forced_bos_token_id=self.tokenizer.get_lang_id("en"),
136
+ max_length=512,
137
+ num_beams=2,
138
+ early_stopping=True
139
+ )
140
+
141
+ # Decode translation
142
+ translation = self.tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
143
+ return translation.strip()
144
+
145
+ except Exception as e:
146
+ logger.error(f"Translation error: {str(e)}")
147
+ return text
148
+
149
  def process_reviews(self, reviews: List[str]) -> Tuple[List[str], List[str]]:
150
  """
151
  Process list of reviews for translation.
152
+
153
  Args:
154
  reviews: List of review texts
155
+
156
  Returns:
157
  Tuple of (translated_reviews, detected_languages)
158
  """
159
  translated_reviews = []
160
  detected_languages = []
161
+
162
+ for review in reviews:
 
 
 
163
  lang = self.detect_language(review)
164
  detected_languages.append(lang)
165
+
166
  if lang == 'hi': # Hindi detected
167
  translated = self.translate_to_english(review, 'hi')
168
  translated_reviews.append(translated)
169
  else:
170
  translated_reviews.append(review) # Keep original if not Hindi
171
+
172
  return translated_reviews, detected_languages
173
 
174
 
175
  class ABSAProcessor:
176
+ """Handles Aspect-Based Sentiment Analysis using pyABSA."""
177
+
178
  def __init__(self):
179
+ self.aspect_extractor = None
180
+ self._load_model()
181
+
182
+ def _load_model(self):
183
+ """Load pyABSA model with fallback error handling."""
 
 
 
184
  try:
185
+ # Import inside try block to catch any import-time type errors
186
+ import pyabsa
187
+ from pyabsa import ATEPCCheckpointManager
188
+
189
+ # Try multiple checkpoint options in order of preference
190
+ checkpoint_options = [
191
+ 'multilingual',
192
+ 'multilingual2',
193
+ None # Let pyABSA use default
194
+ ]
195
+
196
+ for checkpoint in checkpoint_options:
197
+ try:
198
+ logger.info(f"Attempting to load ABSA checkpoint: {checkpoint}")
199
+
200
+ if checkpoint is None:
201
+ # Try without specifying checkpoint
202
+ self.aspect_extractor = ATEPCCheckpointManager.get_aspect_extractor(
203
+ auto_device=True,
204
+ task_code='ATEPC'
205
+ )
206
+ else:
207
+ self.aspect_extractor = ATEPCCheckpointManager.get_aspect_extractor(
208
+ checkpoint=checkpoint,
209
+ auto_device=True,
210
+ task_code='ATEPC'
211
+ )
212
+
213
+ logger.info(f"ABSA model loaded successfully with checkpoint: {checkpoint}")
214
+ return # Success, exit the method
215
+
216
+ except Exception as e:
217
+ logger.warning(f"Failed to load checkpoint '{checkpoint}': {str(e)}")
218
+ continue # Try next checkpoint
219
+
220
+ # If all checkpoints failed
221
+ logger.error("All ABSA checkpoint options failed")
222
+ self.aspect_extractor = None
223
+
224
+ except ImportError as e:
225
+ logger.error(f"pyABSA library not available: {str(e)}")
226
+ st.warning("⚠️ ABSA functionality unavailable. Advanced sentiment analysis will be limited.")
227
+ self.aspect_extractor = None
228
+ except TypeError as e:
229
+ # Handle Python version compatibility issues
230
+ logger.error(f"Type compatibility error in pyABSA: {str(e)}")
231
+ st.warning("⚠️ ABSA model incompatible with current Python version. Using fallback sentiment analysis.")
232
+ self.aspect_extractor = None
233
+ except Exception as e:
234
+ logger.error(f"Error loading ABSA model: {str(e)}")
235
+ st.warning(f"⚠️ Could not load ABSA model: {str(e)[:100]}... Using basic sentiment analysis.")
236
+ self.aspect_extractor = None
237
+
238
  def extract_aspects_and_sentiments(self, reviews: List[str]) -> List[Dict[str, Any]]:
239
  """
240
+ Extract aspects and sentiments from reviews.
241
+
242
  Args:
243
  reviews: List of review texts
244
+
245
  Returns:
246
  List of dictionaries containing extracted information
247
  """
248
+ if not self.aspect_extractor:
249
+ logger.warning("ABSA model not available, returning empty results")
250
+ return []
251
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
252
  try:
253
+ results = self.aspect_extractor.extract_aspect(
254
+ reviews,
255
+ pred_sentiment=True
256
+ )
257
+
258
+ processed_results = []
259
+ for result in results:
260
+ processed_result = {
261
+ 'sentence': result['sentence'],
262
+ 'aspects': result.get('aspect', []),
263
+ 'sentiments': result.get('sentiment', []),
264
+ 'positions': result.get('position', []),
265
+ 'confidence_scores': result.get('confidence', []),
266
+ 'tokens': result.get('tokens', []),
267
+ 'iob_tags': result.get('IOB', [])
268
+ }
269
+ processed_results.append(processed_result)
270
+
271
+ return processed_results
 
 
 
 
 
272
  except Exception as e:
273
+ logger.error(f"ABSA extraction error: {str(e)}")
274
+ return []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
275
 
276
 
277