BrandScanAI / README.md
Arun21102003
Deployment preparation (removed binary files)
90fe073

A newer version of the Streamlit SDK is available: 1.58.0

Upgrade
metadata
title: BrandScanAI
emoji: πŸ”
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.31.0
app_file: app.py
pinned: false

BrandScanAI: Open-Source Brand Monitoring with LLM Analysis

Overview

BrandScanAI is a comprehensive brand monitoring system that combines web search, content extraction, and AI-powered sentiment analysis to track brand mentions across the internet. Built with Streamlit and powered by open-source LLMs, it provides real-time insights into brand perception and media coverage.

Open-Source LLM APIs Explored

Primary Implementation: Groq + Llama Models

  • Model: Llama 3.1 8B Instant (via Groq API)
  • Tradeoffs:
    • Speed: ⚑ Extremely fast inference (sub-second response times)
    • Accuracy: 🎯 Good for sentiment analysis and structured extraction
    • Documentation: πŸ“š Excellent Groq documentation with clear examples
    • Cost: πŸ’° Very affordable ($0.27/1M tokens for Llama 3.1 8B)
    • Limitations: Smaller context window compared to larger models

Alternative Models Considered

  • Llama 3.3 70B: Higher accuracy but slower inference and higher cost
  • Code Llama: Specialized for code analysis but less suitable for general text
  • Mistral 7B: Good balance but Groq's Llama 3.1 8B proved more reliable

Technical Challenges & Solutions

1. Web Crawling Challenges

  • Anti-bot measures: Implemented respectful delays (0.5s) and proper User-Agent headers
  • Content extraction: Used Trafilatura for robust article extraction vs. basic BeautifulSoup
  • Rate limiting: Graceful error handling with informative user feedback
  • Dynamic content: Limited JavaScript-heavy sites, focused on static content

2. LLM Querying Issues

  • JSON parsing errors: Enforced response_format={"type": "json_object"} in API calls
  • Inconsistent outputs: Implemented structured prompts with explicit JSON schema
  • Context length: Limited article content to 1000 characters for analysis
  • API reliability: Added retry logic and fallback error responses

3. Context Extraction Problems

  • Noise removal: Trafilatura effectively strips ads, navigation, and boilerplate
  • Metadata extraction: Combined Trafilatura metadata with BeautifulSoup fallback
  • Content quality: Implemented content length validation before analysis

Scalability & Robustness Improvements

Production-Ready Enhancements

  1. Database Integration: SQLite/PostgreSQL for persistent storage and historical analysis
  2. Queue System: Celery/Redis for background processing of large batches
  3. Caching Layer: Redis for API response caching and rate limit management
  4. Monitoring: Prometheus/Grafana for system health and performance tracking
  5. Load Balancing: Multiple worker processes for concurrent analysis
  6. Error Recovery: Retry mechanisms with exponential backoff
  7. API Rate Limiting: Intelligent request throttling across multiple providers

Architecture Improvements

  • Microservices: Separate services for search, scraping, and analysis
  • Message Queues: Asynchronous processing for large-scale monitoring
  • CDN Integration: Cached content delivery for faster responses
  • Multi-region Deployment: Geographic distribution for global brand monitoring

LLM Comparison: Llama 3.1 8B vs Llama 3.3 70B

Test Case: Brand Sentiment Analysis

Input: "OpenAI's new GPT-4 model shows impressive capabilities but raises concerns about AI safety and job displacement."

Llama 3.1 8B Response:

{
  "explicit_mentions": [{
    "mention": "OpenAI's new GPT-4 model",
    "sentiment": "positive",
    "explanation": "Shows impressive capabilities"
  }],
  "indirect_mentions": [],
  "overall_sentiment": "neutral"
}

Llama 3.3 70B Response:

{
  "explicit_mentions": [{
    "mention": "OpenAI's new GPT-4 model",
    "sentiment": "positive", 
    "explanation": "Shows impressive capabilities"
  }],
  "indirect_mentions": [{
    "reference": "AI safety and job displacement",
    "sentiment": "negative",
    "explanation": "Raises concerns about negative impacts"
  }],
  "overall_sentiment": "neutral"
}

Key Differences:

  • 3.3 70B: More nuanced analysis, catches indirect negative mentions
  • 3.1 8B: Faster but misses subtle context and indirect references
  • Trade-off: 70B provides better accuracy but 3x slower and 10x more expensive

Setup & Installation

Prerequisites

  • Python 3.11+
  • API keys for Groq and SerpAPI

Installation

# Clone repository
git clone https://github.com/yourusername/brandscan-ai.git
cd brandscan-ai

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env with your API keys

Environment Variables

# Required API Keys
GROQ_API_KEY=your_groq_api_key_here
SERPAPI_API_KEY=your_serpapi_key_here

# Database (optional - defaults to SQLite)
DATABASE_URL=sqlite:///./brandscan.db

API Key Setup

  1. Groq API: Visit console.groq.com β†’ Sign up β†’ Get API key
  2. SerpAPI: Visit serpapi.com β†’ Sign up β†’ Get API key

Running the Application

# Start the Streamlit app
streamlit run app.py

# Access at http://localhost:8501

Deploying to Hugging Face Spaces

  1. Create a Space: Go to huggingface.co/new-space.
  2. Configure:
    • Name: brandscan-ai
    • SDK: Streamlit
    • Privacy: Public (or Private)
  3. Upload Files: Upload all project files (except .venv, .env, and brandscan.db). The .hfignore file will handle this if you use Git.
  4. Set Secrets: Go to Settings -> Variables and secrets -> New secret:
    • GROQ_API_KEY: Your Groq API key
    • SERPAPI_API_KEY: Your SerpAPI key
    • DATABASE_URL: sqlite:///./brandscan.db (Note: SQLite is not persistent on HF Spaces. For persistence, use an external PostgreSQL DB or HF Datasets).
  5. Wait for Build: Hugging Face will automatically build and deploy your app.

Usage

  1. Configure Search: Enter search query and brand names
  2. Select Engines: Choose from Google, Bing, DuckDuckGo
  3. Run Analysis: Click "Start Batch Analysis"
  4. View Results: Explore mentions, sentiment, and context
  5. Export Data: Download CSV reports for further analysis

Dependencies

Core Libraries

  • streamlit>=1.50.0 - Web application framework
  • groq>=0.32.0 - Groq API client
  • serpapi>=0.1.5 - Google Search API
  • trafilatura>=2.0.0 - Web content extraction
  • sqlalchemy>=2.0.44 - Database ORM
  • pandas>=2.3.3 - Data manipulation
  • plotly>=6.3.1 - Interactive visualizations

Optional Dependencies

  • psycopg2-binary - PostgreSQL support
  • PyMySQL - MySQL support
  • duckduckgo-search - DuckDuckGo search
  • beautifulsoup4 - HTML parsing fallback

Features

  • πŸ” Multi-Engine Search: Google, Bing, DuckDuckGo
  • πŸ€– AI-Powered Analysis: Sentiment analysis with context
  • πŸ“Š Interactive Dashboard: Real-time analytics and visualizations
  • πŸ’Ύ Data Export: CSV reports and database storage
  • ⏰ Scheduled Monitoring: Automated recurring analysis
  • πŸ•ΈοΈ Co-Mention Network: Brand relationship visualization
  • πŸ“ˆ Historical Tracking: Trend analysis over time

License

MIT License - see LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

Support

For issues and questions:

  • Create an issue on GitHub
  • Check the documentation
  • Review the troubleshooting guide