Spaces:

spamultrapromax
/

BrandScanAI

Running

App Files Files Community

BrandScanAI / README.md

Arun21102003

Deployment preparation (removed binary files)

90fe073 23 days ago

preview code

raw

history blame contribute delete

7.72 kB

	---
	title: BrandScanAI
	emoji: 🔍
	colorFrom: blue
	colorTo: indigo
	sdk: streamlit
	sdk_version: 1.31.0
	app_file: app.py
	pinned: false
	---

	# BrandScanAI: Open-Source Brand Monitoring with LLM Analysis

	## Overview

	BrandScanAI is a comprehensive brand monitoring system that combines web search, content extraction, and AI-powered sentiment analysis to track brand mentions across the internet. Built with Streamlit and powered by open-source LLMs, it provides real-time insights into brand perception and media coverage.

	## Open-Source LLM APIs Explored

	### Primary Implementation: Groq + Llama Models
	- Model: Llama 3.1 8B Instant (via Groq API)
	- Tradeoffs:
	- Speed: ⚡ Extremely fast inference (sub-second response times)
	- Accuracy: 🎯 Good for sentiment analysis and structured extraction
	- Documentation: 📚 Excellent Groq documentation with clear examples
	- Cost: 💰 Very affordable ($0.27/1M tokens for Llama 3.1 8B)
	- Limitations: Smaller context window compared to larger models

	### Alternative Models Considered
	- Llama 3.3 70B: Higher accuracy but slower inference and higher cost
	- Code Llama: Specialized for code analysis but less suitable for general text
	- Mistral 7B: Good balance but Groq's Llama 3.1 8B proved more reliable

	## Technical Challenges & Solutions

	### 1. Web Crawling Challenges
	- Anti-bot measures: Implemented respectful delays (0.5s) and proper User-Agent headers
	- Content extraction: Used Trafilatura for robust article extraction vs. basic BeautifulSoup
	- Rate limiting: Graceful error handling with informative user feedback
	- Dynamic content: Limited JavaScript-heavy sites, focused on static content

	### 2. LLM Querying Issues
	- JSON parsing errors: Enforced `response_format={"type": "json_object"}` in API calls
	- Inconsistent outputs: Implemented structured prompts with explicit JSON schema
	- Context length: Limited article content to 1000 characters for analysis
	- API reliability: Added retry logic and fallback error responses

	### 3. Context Extraction Problems
	- Noise removal: Trafilatura effectively strips ads, navigation, and boilerplate
	- Metadata extraction: Combined Trafilatura metadata with BeautifulSoup fallback
	- Content quality: Implemented content length validation before analysis

	## Scalability & Robustness Improvements

	### Production-Ready Enhancements
	1. Database Integration: SQLite/PostgreSQL for persistent storage and historical analysis
	2. Queue System: Celery/Redis for background processing of large batches
	3. Caching Layer: Redis for API response caching and rate limit management
	4. Monitoring: Prometheus/Grafana for system health and performance tracking
	5. Load Balancing: Multiple worker processes for concurrent analysis
	6. Error Recovery: Retry mechanisms with exponential backoff
	7. API Rate Limiting: Intelligent request throttling across multiple providers

	### Architecture Improvements
	- Microservices: Separate services for search, scraping, and analysis
	- Message Queues: Asynchronous processing for large-scale monitoring
	- CDN Integration: Cached content delivery for faster responses
	- Multi-region Deployment: Geographic distribution for global brand monitoring

	## LLM Comparison: Llama 3.1 8B vs Llama 3.3 70B

	### Test Case: Brand Sentiment Analysis
	Input: "OpenAI's new GPT-4 model shows impressive capabilities but raises concerns about AI safety and job displacement."

	### Llama 3.1 8B Response:
	```json
	{
	"explicit_mentions": [{
	"mention": "OpenAI's new GPT-4 model",
	"sentiment": "positive",
	"explanation": "Shows impressive capabilities"
	}],
	"indirect_mentions": [],
	"overall_sentiment": "neutral"
	}
	```

	### Llama 3.3 70B Response:
	```json
	{
	"explicit_mentions": [{
	"mention": "OpenAI's new GPT-4 model",
	"sentiment": "positive",
	"explanation": "Shows impressive capabilities"
	}],
	"indirect_mentions": [{
	"reference": "AI safety and job displacement",
	"sentiment": "negative",
	"explanation": "Raises concerns about negative impacts"
	}],
	"overall_sentiment": "neutral"
	}
	```

	Key Differences:
	- 3.3 70B: More nuanced analysis, catches indirect negative mentions
	- 3.1 8B: Faster but misses subtle context and indirect references
	- Trade-off: 70B provides better accuracy but 3x slower and 10x more expensive

	## Setup & Installation

	### Prerequisites
	- Python 3.11+
	- API keys for Groq and SerpAPI

	### Installation
	```bash
	# Clone repository
	git clone https://github.com/yourusername/brandscan-ai.git
	cd brandscan-ai

	# Install dependencies
	pip install -r requirements.txt

	# Set up environment variables
	cp .env.example .env
	# Edit .env with your API keys
	```

	### Environment Variables
	```bash
	# Required API Keys
	GROQ_API_KEY=your_groq_api_key_here
	SERPAPI_API_KEY=your_serpapi_key_here

	# Database (optional - defaults to SQLite)
	DATABASE_URL=sqlite:///./brandscan.db
	```

	### API Key Setup
	1. Groq API: Visit [console.groq.com](https://console.groq.com) → Sign up → Get API key
	2. SerpAPI: Visit [serpapi.com](https://serpapi.com) → Sign up → Get API key

	### Running the Application
	```bash
	# Start the Streamlit app
	streamlit run app.py

	# Access at http://localhost:8501
	```

	## Deploying to Hugging Face Spaces

	1. Create a Space: Go to [huggingface.co/new-space](https://huggingface.co/new-space).
	2. Configure:
	- Name: `brandscan-ai`
	- SDK: Streamlit
	- Privacy: Public (or Private)
	3. Upload Files: Upload all project files (except `.venv`, `.env`, and `brandscan.db`). The `.hfignore` file will handle this if you use Git.
	4. Set Secrets: Go to Settings -> Variables and secrets -> New secret:
	- `GROQ_API_KEY`: Your Groq API key
	- `SERPAPI_API_KEY`: Your SerpAPI key
	- `DATABASE_URL`: `sqlite:///./brandscan.db` (Note: SQLite is not persistent on HF Spaces. For persistence, use an external PostgreSQL DB or HF Datasets).
	5. Wait for Build: Hugging Face will automatically build and deploy your app.

	## Usage

	1. Configure Search: Enter search query and brand names
	2. Select Engines: Choose from Google, Bing, DuckDuckGo
	3. Run Analysis: Click "Start Batch Analysis"
	4. View Results: Explore mentions, sentiment, and context
	5. Export Data: Download CSV reports for further analysis

	## Dependencies

	### Core Libraries
	- `streamlit>=1.50.0` - Web application framework
	- `groq>=0.32.0` - Groq API client
	- `serpapi>=0.1.5` - Google Search API
	- `trafilatura>=2.0.0` - Web content extraction
	- `sqlalchemy>=2.0.44` - Database ORM
	- `pandas>=2.3.3` - Data manipulation
	- `plotly>=6.3.1` - Interactive visualizations

	### Optional Dependencies
	- `psycopg2-binary` - PostgreSQL support
	- `PyMySQL` - MySQL support
	- `duckduckgo-search` - DuckDuckGo search
	- `beautifulsoup4` - HTML parsing fallback

	## Features

	- 🔍 Multi-Engine Search: Google, Bing, DuckDuckGo
	- 🤖 AI-Powered Analysis: Sentiment analysis with context
	- 📊 Interactive Dashboard: Real-time analytics and visualizations
	- 💾 Data Export: CSV reports and database storage
	- ⏰ Scheduled Monitoring: Automated recurring analysis
	- 🕸️ Co-Mention Network: Brand relationship visualization
	- 📈 Historical Tracking: Trend analysis over time

	## License

	MIT License - see LICENSE file for details.

	## Contributing

	1. Fork the repository
	2. Create a feature branch
	3. Make your changes
	4. Add tests if applicable
	5. Submit a pull request

	## Support

	For issues and questions:
	- Create an issue on GitHub
	- Check the documentation
	- Review the troubleshooting guide