--- title: BrandScanAI emoji: πŸ” colorFrom: blue colorTo: indigo sdk: streamlit sdk_version: 1.31.0 app_file: app.py pinned: false --- # BrandScanAI: Open-Source Brand Monitoring with LLM Analysis ## Overview BrandScanAI is a comprehensive brand monitoring system that combines web search, content extraction, and AI-powered sentiment analysis to track brand mentions across the internet. Built with Streamlit and powered by open-source LLMs, it provides real-time insights into brand perception and media coverage. ## Open-Source LLM APIs Explored ### Primary Implementation: Groq + Llama Models - **Model**: Llama 3.1 8B Instant (via Groq API) - **Tradeoffs**: - **Speed**: ⚑ Extremely fast inference (sub-second response times) - **Accuracy**: 🎯 Good for sentiment analysis and structured extraction - **Documentation**: πŸ“š Excellent Groq documentation with clear examples - **Cost**: πŸ’° Very affordable ($0.27/1M tokens for Llama 3.1 8B) - **Limitations**: Smaller context window compared to larger models ### Alternative Models Considered - **Llama 3.3 70B**: Higher accuracy but slower inference and higher cost - **Code Llama**: Specialized for code analysis but less suitable for general text - **Mistral 7B**: Good balance but Groq's Llama 3.1 8B proved more reliable ## Technical Challenges & Solutions ### 1. Web Crawling Challenges - **Anti-bot measures**: Implemented respectful delays (0.5s) and proper User-Agent headers - **Content extraction**: Used Trafilatura for robust article extraction vs. basic BeautifulSoup - **Rate limiting**: Graceful error handling with informative user feedback - **Dynamic content**: Limited JavaScript-heavy sites, focused on static content ### 2. LLM Querying Issues - **JSON parsing errors**: Enforced `response_format={"type": "json_object"}` in API calls - **Inconsistent outputs**: Implemented structured prompts with explicit JSON schema - **Context length**: Limited article content to 1000 characters for analysis - **API reliability**: Added retry logic and fallback error responses ### 3. Context Extraction Problems - **Noise removal**: Trafilatura effectively strips ads, navigation, and boilerplate - **Metadata extraction**: Combined Trafilatura metadata with BeautifulSoup fallback - **Content quality**: Implemented content length validation before analysis ## Scalability & Robustness Improvements ### Production-Ready Enhancements 1. **Database Integration**: SQLite/PostgreSQL for persistent storage and historical analysis 2. **Queue System**: Celery/Redis for background processing of large batches 3. **Caching Layer**: Redis for API response caching and rate limit management 4. **Monitoring**: Prometheus/Grafana for system health and performance tracking 5. **Load Balancing**: Multiple worker processes for concurrent analysis 6. **Error Recovery**: Retry mechanisms with exponential backoff 7. **API Rate Limiting**: Intelligent request throttling across multiple providers ### Architecture Improvements - **Microservices**: Separate services for search, scraping, and analysis - **Message Queues**: Asynchronous processing for large-scale monitoring - **CDN Integration**: Cached content delivery for faster responses - **Multi-region Deployment**: Geographic distribution for global brand monitoring ## LLM Comparison: Llama 3.1 8B vs Llama 3.3 70B ### Test Case: Brand Sentiment Analysis **Input**: "OpenAI's new GPT-4 model shows impressive capabilities but raises concerns about AI safety and job displacement." ### Llama 3.1 8B Response: ```json { "explicit_mentions": [{ "mention": "OpenAI's new GPT-4 model", "sentiment": "positive", "explanation": "Shows impressive capabilities" }], "indirect_mentions": [], "overall_sentiment": "neutral" } ``` ### Llama 3.3 70B Response: ```json { "explicit_mentions": [{ "mention": "OpenAI's new GPT-4 model", "sentiment": "positive", "explanation": "Shows impressive capabilities" }], "indirect_mentions": [{ "reference": "AI safety and job displacement", "sentiment": "negative", "explanation": "Raises concerns about negative impacts" }], "overall_sentiment": "neutral" } ``` **Key Differences**: - **3.3 70B**: More nuanced analysis, catches indirect negative mentions - **3.1 8B**: Faster but misses subtle context and indirect references - **Trade-off**: 70B provides better accuracy but 3x slower and 10x more expensive ## Setup & Installation ### Prerequisites - Python 3.11+ - API keys for Groq and SerpAPI ### Installation ```bash # Clone repository git clone https://github.com/yourusername/brandscan-ai.git cd brandscan-ai # Install dependencies pip install -r requirements.txt # Set up environment variables cp .env.example .env # Edit .env with your API keys ``` ### Environment Variables ```bash # Required API Keys GROQ_API_KEY=your_groq_api_key_here SERPAPI_API_KEY=your_serpapi_key_here # Database (optional - defaults to SQLite) DATABASE_URL=sqlite:///./brandscan.db ``` ### API Key Setup 1. **Groq API**: Visit [console.groq.com](https://console.groq.com) β†’ Sign up β†’ Get API key 2. **SerpAPI**: Visit [serpapi.com](https://serpapi.com) β†’ Sign up β†’ Get API key ### Running the Application ```bash # Start the Streamlit app streamlit run app.py # Access at http://localhost:8501 ``` ## Deploying to Hugging Face Spaces 1. **Create a Space**: Go to [huggingface.co/new-space](https://huggingface.co/new-space). 2. **Configure**: - **Name**: `brandscan-ai` - **SDK**: Streamlit - **Privacy**: Public (or Private) 3. **Upload Files**: Upload all project files (except `.venv`, `.env`, and `brandscan.db`). The `.hfignore` file will handle this if you use Git. 4. **Set Secrets**: Go to **Settings** -> **Variables and secrets** -> **New secret**: - `GROQ_API_KEY`: Your Groq API key - `SERPAPI_API_KEY`: Your SerpAPI key - `DATABASE_URL`: `sqlite:///./brandscan.db` (Note: SQLite is not persistent on HF Spaces. For persistence, use an external PostgreSQL DB or HF Datasets). 5. **Wait for Build**: Hugging Face will automatically build and deploy your app. ## Usage 1. **Configure Search**: Enter search query and brand names 2. **Select Engines**: Choose from Google, Bing, DuckDuckGo 3. **Run Analysis**: Click "Start Batch Analysis" 4. **View Results**: Explore mentions, sentiment, and context 5. **Export Data**: Download CSV reports for further analysis ## Dependencies ### Core Libraries - `streamlit>=1.50.0` - Web application framework - `groq>=0.32.0` - Groq API client - `serpapi>=0.1.5` - Google Search API - `trafilatura>=2.0.0` - Web content extraction - `sqlalchemy>=2.0.44` - Database ORM - `pandas>=2.3.3` - Data manipulation - `plotly>=6.3.1` - Interactive visualizations ### Optional Dependencies - `psycopg2-binary` - PostgreSQL support - `PyMySQL` - MySQL support - `duckduckgo-search` - DuckDuckGo search - `beautifulsoup4` - HTML parsing fallback ## Features - πŸ” **Multi-Engine Search**: Google, Bing, DuckDuckGo - πŸ€– **AI-Powered Analysis**: Sentiment analysis with context - πŸ“Š **Interactive Dashboard**: Real-time analytics and visualizations - πŸ’Ύ **Data Export**: CSV reports and database storage - ⏰ **Scheduled Monitoring**: Automated recurring analysis - πŸ•ΈοΈ **Co-Mention Network**: Brand relationship visualization - πŸ“ˆ **Historical Tracking**: Trend analysis over time ## License MIT License - see LICENSE file for details. ## Contributing 1. Fork the repository 2. Create a feature branch 3. Make your changes 4. Add tests if applicable 5. Submit a pull request ## Support For issues and questions: - Create an issue on GitHub - Check the documentation - Review the troubleshooting guide