BrandScanAI / README.md
Arun21102003
Deployment preparation (removed binary files)
90fe073
---
title: BrandScanAI
emoji: πŸ”
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.31.0
app_file: app.py
pinned: false
---
# BrandScanAI: Open-Source Brand Monitoring with LLM Analysis
## Overview
BrandScanAI is a comprehensive brand monitoring system that combines web search, content extraction, and AI-powered sentiment analysis to track brand mentions across the internet. Built with Streamlit and powered by open-source LLMs, it provides real-time insights into brand perception and media coverage.
## Open-Source LLM APIs Explored
### Primary Implementation: Groq + Llama Models
- **Model**: Llama 3.1 8B Instant (via Groq API)
- **Tradeoffs**:
- **Speed**: ⚑ Extremely fast inference (sub-second response times)
- **Accuracy**: 🎯 Good for sentiment analysis and structured extraction
- **Documentation**: πŸ“š Excellent Groq documentation with clear examples
- **Cost**: πŸ’° Very affordable ($0.27/1M tokens for Llama 3.1 8B)
- **Limitations**: Smaller context window compared to larger models
### Alternative Models Considered
- **Llama 3.3 70B**: Higher accuracy but slower inference and higher cost
- **Code Llama**: Specialized for code analysis but less suitable for general text
- **Mistral 7B**: Good balance but Groq's Llama 3.1 8B proved more reliable
## Technical Challenges & Solutions
### 1. Web Crawling Challenges
- **Anti-bot measures**: Implemented respectful delays (0.5s) and proper User-Agent headers
- **Content extraction**: Used Trafilatura for robust article extraction vs. basic BeautifulSoup
- **Rate limiting**: Graceful error handling with informative user feedback
- **Dynamic content**: Limited JavaScript-heavy sites, focused on static content
### 2. LLM Querying Issues
- **JSON parsing errors**: Enforced `response_format={"type": "json_object"}` in API calls
- **Inconsistent outputs**: Implemented structured prompts with explicit JSON schema
- **Context length**: Limited article content to 1000 characters for analysis
- **API reliability**: Added retry logic and fallback error responses
### 3. Context Extraction Problems
- **Noise removal**: Trafilatura effectively strips ads, navigation, and boilerplate
- **Metadata extraction**: Combined Trafilatura metadata with BeautifulSoup fallback
- **Content quality**: Implemented content length validation before analysis
## Scalability & Robustness Improvements
### Production-Ready Enhancements
1. **Database Integration**: SQLite/PostgreSQL for persistent storage and historical analysis
2. **Queue System**: Celery/Redis for background processing of large batches
3. **Caching Layer**: Redis for API response caching and rate limit management
4. **Monitoring**: Prometheus/Grafana for system health and performance tracking
5. **Load Balancing**: Multiple worker processes for concurrent analysis
6. **Error Recovery**: Retry mechanisms with exponential backoff
7. **API Rate Limiting**: Intelligent request throttling across multiple providers
### Architecture Improvements
- **Microservices**: Separate services for search, scraping, and analysis
- **Message Queues**: Asynchronous processing for large-scale monitoring
- **CDN Integration**: Cached content delivery for faster responses
- **Multi-region Deployment**: Geographic distribution for global brand monitoring
## LLM Comparison: Llama 3.1 8B vs Llama 3.3 70B
### Test Case: Brand Sentiment Analysis
**Input**: "OpenAI's new GPT-4 model shows impressive capabilities but raises concerns about AI safety and job displacement."
### Llama 3.1 8B Response:
```json
{
"explicit_mentions": [{
"mention": "OpenAI's new GPT-4 model",
"sentiment": "positive",
"explanation": "Shows impressive capabilities"
}],
"indirect_mentions": [],
"overall_sentiment": "neutral"
}
```
### Llama 3.3 70B Response:
```json
{
"explicit_mentions": [{
"mention": "OpenAI's new GPT-4 model",
"sentiment": "positive",
"explanation": "Shows impressive capabilities"
}],
"indirect_mentions": [{
"reference": "AI safety and job displacement",
"sentiment": "negative",
"explanation": "Raises concerns about negative impacts"
}],
"overall_sentiment": "neutral"
}
```
**Key Differences**:
- **3.3 70B**: More nuanced analysis, catches indirect negative mentions
- **3.1 8B**: Faster but misses subtle context and indirect references
- **Trade-off**: 70B provides better accuracy but 3x slower and 10x more expensive
## Setup & Installation
### Prerequisites
- Python 3.11+
- API keys for Groq and SerpAPI
### Installation
```bash
# Clone repository
git clone https://github.com/yourusername/brandscan-ai.git
cd brandscan-ai
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env
# Edit .env with your API keys
```
### Environment Variables
```bash
# Required API Keys
GROQ_API_KEY=your_groq_api_key_here
SERPAPI_API_KEY=your_serpapi_key_here
# Database (optional - defaults to SQLite)
DATABASE_URL=sqlite:///./brandscan.db
```
### API Key Setup
1. **Groq API**: Visit [console.groq.com](https://console.groq.com) β†’ Sign up β†’ Get API key
2. **SerpAPI**: Visit [serpapi.com](https://serpapi.com) β†’ Sign up β†’ Get API key
### Running the Application
```bash
# Start the Streamlit app
streamlit run app.py
# Access at http://localhost:8501
```
## Deploying to Hugging Face Spaces
1. **Create a Space**: Go to [huggingface.co/new-space](https://huggingface.co/new-space).
2. **Configure**:
- **Name**: `brandscan-ai`
- **SDK**: Streamlit
- **Privacy**: Public (or Private)
3. **Upload Files**: Upload all project files (except `.venv`, `.env`, and `brandscan.db`). The `.hfignore` file will handle this if you use Git.
4. **Set Secrets**: Go to **Settings** -> **Variables and secrets** -> **New secret**:
- `GROQ_API_KEY`: Your Groq API key
- `SERPAPI_API_KEY`: Your SerpAPI key
- `DATABASE_URL`: `sqlite:///./brandscan.db` (Note: SQLite is not persistent on HF Spaces. For persistence, use an external PostgreSQL DB or HF Datasets).
5. **Wait for Build**: Hugging Face will automatically build and deploy your app.
## Usage
1. **Configure Search**: Enter search query and brand names
2. **Select Engines**: Choose from Google, Bing, DuckDuckGo
3. **Run Analysis**: Click "Start Batch Analysis"
4. **View Results**: Explore mentions, sentiment, and context
5. **Export Data**: Download CSV reports for further analysis
## Dependencies
### Core Libraries
- `streamlit>=1.50.0` - Web application framework
- `groq>=0.32.0` - Groq API client
- `serpapi>=0.1.5` - Google Search API
- `trafilatura>=2.0.0` - Web content extraction
- `sqlalchemy>=2.0.44` - Database ORM
- `pandas>=2.3.3` - Data manipulation
- `plotly>=6.3.1` - Interactive visualizations
### Optional Dependencies
- `psycopg2-binary` - PostgreSQL support
- `PyMySQL` - MySQL support
- `duckduckgo-search` - DuckDuckGo search
- `beautifulsoup4` - HTML parsing fallback
## Features
- πŸ” **Multi-Engine Search**: Google, Bing, DuckDuckGo
- πŸ€– **AI-Powered Analysis**: Sentiment analysis with context
- πŸ“Š **Interactive Dashboard**: Real-time analytics and visualizations
- πŸ’Ύ **Data Export**: CSV reports and database storage
- ⏰ **Scheduled Monitoring**: Automated recurring analysis
- πŸ•ΈοΈ **Co-Mention Network**: Brand relationship visualization
- πŸ“ˆ **Historical Tracking**: Trend analysis over time
## License
MIT License - see LICENSE file for details.
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
## Support
For issues and questions:
- Create an issue on GitHub
- Check the documentation
- Review the troubleshooting guide