---
title: BrandScanAI
emoji: 🔍
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.31.0
app_file: app.py
pinned: false
---

# BrandScanAI: Open-Source Brand Monitoring with LLM Analysis

## Overview

BrandScanAI is a comprehensive brand monitoring system that combines web search, content extraction, and AI-powered sentiment analysis to track brand mentions across the internet. Built with Streamlit and powered by open-source LLMs, it provides real-time insights into brand perception and media coverage.

## Open-Source LLM APIs Explored

### Primary Implementation: Groq + Llama Models
- **Model**: Llama 3.1 8B Instant (via Groq API)
- **Tradeoffs**:
  - **Speed**: ⚡ Extremely fast inference (sub-second response times)
  - **Accuracy**: 🎯 Good for sentiment analysis and structured extraction
  - **Documentation**: 📚 Excellent Groq documentation with clear examples
  - **Cost**: 💰 Very affordable ($0.27/1M tokens for Llama 3.1 8B)
  - **Limitations**: Smaller context window compared to larger models

### Alternative Models Considered
- **Llama 3.3 70B**: Higher accuracy but slower inference and higher cost
- **Code Llama**: Specialized for code analysis but less suitable for general text
- **Mistral 7B**: Good balance but Groq's Llama 3.1 8B proved more reliable

## Technical Challenges & Solutions

### 1. Web Crawling Challenges
- **Anti-bot measures**: Implemented respectful delays (0.5s) and proper User-Agent headers
- **Content extraction**: Used Trafilatura for robust article extraction vs. basic BeautifulSoup
- **Rate limiting**: Graceful error handling with informative user feedback
- **Dynamic content**: Limited JavaScript-heavy sites, focused on static content

### 2. LLM Querying Issues
- **JSON parsing errors**: Enforced `response_format={"type": "json_object"}` in API calls
- **Inconsistent outputs**: Implemented structured prompts with explicit JSON schema
- **Context length**: Limited article content to 1000 characters for analysis
- **API reliability**: Added retry logic and fallback error responses

### 3. Context Extraction Problems
- **Noise removal**: Trafilatura effectively strips ads, navigation, and boilerplate
- **Metadata extraction**: Combined Trafilatura metadata with BeautifulSoup fallback
- **Content quality**: Implemented content length validation before analysis

## Scalability & Robustness Improvements

### Production-Ready Enhancements
1. **Database Integration**: SQLite/PostgreSQL for persistent storage and historical analysis
2. **Queue System**: Celery/Redis for background processing of large batches
3. **Caching Layer**: Redis for API response caching and rate limit management
4. **Monitoring**: Prometheus/Grafana for system health and performance tracking
5. **Load Balancing**: Multiple worker processes for concurrent analysis
6. **Error Recovery**: Retry mechanisms with exponential backoff
7. **API Rate Limiting**: Intelligent request throttling across multiple providers

### Architecture Improvements
- **Microservices**: Separate services for search, scraping, and analysis
- **Message Queues**: Asynchronous processing for large-scale monitoring
- **CDN Integration**: Cached content delivery for faster responses
- **Multi-region Deployment**: Geographic distribution for global brand monitoring

## LLM Comparison: Llama 3.1 8B vs Llama 3.3 70B

### Test Case: Brand Sentiment Analysis
**Input**: "OpenAI's new GPT-4 model shows impressive capabilities but raises concerns about AI safety and job displacement."

### Llama 3.1 8B Response:
```json
{
  "explicit_mentions": [{
    "mention": "OpenAI's new GPT-4 model",
    "sentiment": "positive",
    "explanation": "Shows impressive capabilities"
  }],
  "indirect_mentions": [],
  "overall_sentiment": "neutral"
}
```

### Llama 3.3 70B Response:
```json
{
  "explicit_mentions": [{
    "mention": "OpenAI's new GPT-4 model",
    "sentiment": "positive", 
    "explanation": "Shows impressive capabilities"
  }],
  "indirect_mentions": [{
    "reference": "AI safety and job displacement",
    "sentiment": "negative",
    "explanation": "Raises concerns about negative impacts"
  }],
  "overall_sentiment": "neutral"
}
```

**Key Differences**:
- **3.3 70B**: More nuanced analysis, catches indirect negative mentions
- **3.1 8B**: Faster but misses subtle context and indirect references
- **Trade-off**: 70B provides better accuracy but 3x slower and 10x more expensive

## Setup & Installation

### Prerequisites
- Python 3.11+
- API keys for Groq and SerpAPI

### Installation
```bash
# Clone repository
git clone https://github.com/yourusername/brandscan-ai.git
cd brandscan-ai

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env with your API keys
```

### Environment Variables
```bash
# Required API Keys
GROQ_API_KEY=your_groq_api_key_here
SERPAPI_API_KEY=your_serpapi_key_here

# Database (optional - defaults to SQLite)
DATABASE_URL=sqlite:///./brandscan.db
```

### API Key Setup
1. **Groq API**: Visit [console.groq.com](https://console.groq.com) → Sign up → Get API key
2. **SerpAPI**: Visit [serpapi.com](https://serpapi.com) → Sign up → Get API key

### Running the Application
```bash
# Start the Streamlit app
streamlit run app.py

# Access at http://localhost:8501
```

## Deploying to Hugging Face Spaces

1. **Create a Space**: Go to [huggingface.co/new-space](https://huggingface.co/new-space).
2. **Configure**:
   - **Name**: `brandscan-ai`
   - **SDK**: Streamlit
   - **Privacy**: Public (or Private)
3. **Upload Files**: Upload all project files (except `.venv`, `.env`, and `brandscan.db`). The `.hfignore` file will handle this if you use Git.
4. **Set Secrets**: Go to **Settings** -> **Variables and secrets** -> **New secret**:
   - `GROQ_API_KEY`: Your Groq API key
   - `SERPAPI_API_KEY`: Your SerpAPI key
   - `DATABASE_URL`: `sqlite:///./brandscan.db` (Note: SQLite is not persistent on HF Spaces. For persistence, use an external PostgreSQL DB or HF Datasets).
5. **Wait for Build**: Hugging Face will automatically build and deploy your app.

## Usage

1. **Configure Search**: Enter search query and brand names
2. **Select Engines**: Choose from Google, Bing, DuckDuckGo
3. **Run Analysis**: Click "Start Batch Analysis"
4. **View Results**: Explore mentions, sentiment, and context
5. **Export Data**: Download CSV reports for further analysis

## Dependencies

### Core Libraries
- `streamlit>=1.50.0` - Web application framework
- `groq>=0.32.0` - Groq API client
- `serpapi>=0.1.5` - Google Search API
- `trafilatura>=2.0.0` - Web content extraction
- `sqlalchemy>=2.0.44` - Database ORM
- `pandas>=2.3.3` - Data manipulation
- `plotly>=6.3.1` - Interactive visualizations

### Optional Dependencies
- `psycopg2-binary` - PostgreSQL support
- `PyMySQL` - MySQL support
- `duckduckgo-search` - DuckDuckGo search
- `beautifulsoup4` - HTML parsing fallback

## Features

- 🔍 **Multi-Engine Search**: Google, Bing, DuckDuckGo
- 🤖 **AI-Powered Analysis**: Sentiment analysis with context
- 📊 **Interactive Dashboard**: Real-time analytics and visualizations
- 💾 **Data Export**: CSV reports and database storage
- ⏰ **Scheduled Monitoring**: Automated recurring analysis
- 🕸️ **Co-Mention Network**: Brand relationship visualization
- 📈 **Historical Tracking**: Trend analysis over time

## License

MIT License - see LICENSE file for details.

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request

## Support

For issues and questions:
- Create an issue on GitHub
- Check the documentation
- Review the troubleshooting guide