Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| # Automated eBoard Scraping Solutions | |
| This guide covers **fully automated** solutions to bypass Incapsula protection without manual cookie extraction. | |
| --- | |
| ## Summary of Options | |
| | Solution | Cost | Difficulty | Success Rate | Speed | | |
| |----------|------|------------|--------------|-------| | |
| | **1. Undetected ChromeDriver** | Free | Easy | 70-85% | Medium | | |
| | **2. Playwright + Residential Proxies** | $10-50/month | Medium | 90-95% | Fast | | |
| | **3. Browser Automation Services** | $30-100/month | Easy | 95-99% | Fast | | |
| | **4. Captcha Solving Service** | $1-3/1000 solves | Medium | 85-90% | Slow | | |
| --- | |
| ## Option 1: Undetected ChromeDriver (Recommended for Free Solution) | |
| ### Why It Works | |
| `undetected-chromedriver` patches Selenium to bypass bot detection: | |
| - Removes `navigator.webdriver` flag | |
| - Uses real Chrome binary (not ChromeDriver) | |
| - Randomizes browser fingerprints | |
| - Avoids common detection patterns | |
| ### Installation | |
| ```bash | |
| source .venv/bin/activate | |
| pip install undetected-chromedriver | |
| ``` | |
| ### Usage | |
| ```python | |
| # Run the new scraper | |
| python agents/scraper_undetected.py | |
| ``` | |
| Or integrate into main scraper: | |
| ```bash | |
| python main.py scrape \ | |
| --state AL \ | |
| --municipality "Tuscaloosa City Schools" \ | |
| --url http://simbli.eboardsolutions.com/index.aspx?s=2088 \ | |
| --platform eboard \ | |
| --use-undetected \ | |
| --max-events 0 | |
| ``` | |
| ### Pros | |
| - β Free | |
| - β No external services required | |
| - β Works for most Incapsula sites | |
| - β Easy to implement | |
| ### Cons | |
| - β May still fail on very strict Incapsula settings | |
| - β Requires GUI environment (can't run headless on some systems) | |
| - β Slower than Playwright | |
| --- | |
| ## Option 2: Residential Proxies (Best Success Rate) | |
| ### Why It Works | |
| Incapsula detects datacenter IPs. Residential proxies route through real home IPs that appear legitimate. | |
| ### Recommended Providers | |
| **BrightData (formerly Luminati)** | |
| - Cost: ~$15/GB or $500/month unlimited | |
| - Success rate: 95%+ | |
| - Rotating residential IPs | |
| - https://brightdata.com | |
| **SmartProxy** | |
| - Cost: $75/month for 5GB | |
| - Easy to use | |
| - Good for small projects | |
| - https://smartproxy.com | |
| **Oxylabs** | |
| - Cost: $15/GB | |
| - Enterprise-grade | |
| - https://oxylabs.io | |
| ### Implementation | |
| ```python | |
| # Install | |
| pip install playwright | |
| # Configure proxy in scraper | |
| async with async_playwright() as p: | |
| browser = await p.chromium.launch( | |
| proxy={ | |
| 'server': 'http://proxy.smartproxy.com:10000', | |
| 'username': 'your_username', | |
| 'password': 'your_password' | |
| } | |
| ) | |
| # ... rest of scraping code | |
| ``` | |
| ### Add to agents/scraper.py | |
| ```python | |
| # In _scrape_eboard method, add: | |
| import os | |
| proxy_config = None | |
| if os.getenv('RESIDENTIAL_PROXY_URL'): | |
| proxy_config = { | |
| 'server': os.getenv('RESIDENTIAL_PROXY_URL'), | |
| 'username': os.getenv('PROXY_USERNAME'), | |
| 'password': os.getenv('PROXY_PASSWORD') | |
| } | |
| browser = await p.chromium.launch( | |
| proxy=proxy_config, | |
| headless=True | |
| ) | |
| ``` | |
| ### .env Configuration | |
| ```bash | |
| # Add to .env file | |
| RESIDENTIAL_PROXY_URL=http://proxy.smartproxy.com:10000 | |
| PROXY_USERNAME=your_username | |
| PROXY_PASSWORD=your_password | |
| ``` | |
| ### Pros | |
| - β Highest success rate (95%+) | |
| - β Works on any Incapsula configuration | |
| - β Can run headless | |
| - β Fast and reliable | |
| ### Cons | |
| - β Costs money ($10-50/month for small projects) | |
| - β Requires account setup | |
| - β May have usage limits | |
| --- | |
| ## Option 3: Browser Automation Services (Easiest) | |
| ### Why It Works | |
| These services run real browsers in the cloud and handle all anti-bot evasion automatically. | |
| ### Recommended Services | |
| **Browserless.io** | |
| - Cost: $40/month for 20 hours | |
| - Managed Playwright/Puppeteer | |
| - Built-in proxy rotation | |
| - https://browserless.io | |
| ```python | |
| from playwright.async_api import async_playwright | |
| async with async_playwright() as p: | |
| browser = await p.chromium.connect( | |
| 'wss://chrome.browserless.io?token=YOUR_TOKEN' | |
| ) | |
| page = await browser.new_page() | |
| await page.goto('https://simbli.eboardsolutions.com/...') | |
| ``` | |
| **ScrapingBee** | |
| - Cost: $49/month for 100k credits | |
| - Handles all anti-bot automatically | |
| - Simple REST API | |
| - https://scrapingbee.com | |
| ```python | |
| import requests | |
| response = requests.get( | |
| 'https://app.scrapingbee.com/api/v1/', | |
| params={ | |
| 'api_key': 'YOUR_API_KEY', | |
| 'url': 'https://simbli.eboardsolutions.com/...', | |
| 'render_js': 'true', | |
| 'premium_proxy': 'true' | |
| } | |
| ) | |
| content = response.text | |
| ``` | |
| **Apify** | |
| - Cost: $49/month | |
| - Pre-built scrapers for common sites | |
| - Can create custom scrapers | |
| - https://apify.com | |
| ### Pros | |
| - β Fully managed (no maintenance) | |
| - β Very high success rate | |
| - β Handles updates to anti-bot automatically | |
| - β Can scale easily | |
| ### Cons | |
| - β Most expensive option | |
| - β Requires external service dependency | |
| - β May have rate limits | |
| --- | |
| ## Option 4: Captcha Solving Service | |
| ### Why It Works | |
| If Incapsula shows a CAPTCHA, these services solve it automatically using AI or human workers. | |
| ### Recommended Services | |
| **2Captcha** | |
| - Cost: $2.99 per 1000 CAPTCHAs | |
| - Supports reCAPTCHA, hCaptcha, Incapsula | |
| - https://2captcha.com | |
| **Anti-Captcha** | |
| - Cost: $2 per 1000 CAPTCHAs | |
| - Fast (10-30 seconds) | |
| - https://anti-captcha.com | |
| ### Implementation | |
| ```bash | |
| pip install 2captcha-python | |
| ``` | |
| ```python | |
| from twocaptcha import TwoCaptcha | |
| import os | |
| solver = TwoCaptcha(os.getenv('2CAPTCHA_API_KEY')) | |
| # When Incapsula shows CAPTCHA | |
| try: | |
| result = solver.recaptcha( | |
| sitekey='SITE_KEY_FROM_PAGE', | |
| url='https://simbli.eboardsolutions.com/...' | |
| ) | |
| # Inject solution into page | |
| await page.evaluate(f'document.getElementById("g-recaptcha-response").innerHTML="{result["code"]}";') | |
| await page.click('button[type="submit"]') | |
| except Exception as e: | |
| logger.error(f"CAPTCHA solving failed: {e}") | |
| ``` | |
| ### Pros | |
| - β Solves CAPTCHAs automatically | |
| - β Relatively cheap | |
| - β Works with existing scraper | |
| ### Cons | |
| - β Only useful if CAPTCHA appears | |
| - β Slower (10-30 seconds per solve) | |
| - β Not 100% success rate | |
| - β Costs money per use | |
| --- | |
| ## Option 5: Reverse Engineer the API | |
| ### Why It Works | |
| eBoard likely has backend APIs that mobile apps or internal tools use. These APIs may have weaker protection. | |
| ### How to Find APIs | |
| 1. **Use browser DevTools**: | |
| ```bash | |
| # Open eBoard site in Chrome | |
| # Press F12 β Network tab | |
| # Look for XHR/Fetch requests | |
| # Check requests to: | |
| # - /api/ | |
| # - .ashx files | |
| # - .asmx files (SOAP endpoints) | |
| ``` | |
| 2. **Check for mobile app**: | |
| - Search App Store / Google Play for "eBoard Solutions" | |
| - Decompile APK to find API endpoints | |
| - Use mitmproxy to intercept app traffic | |
| 3. **Look for GraphQL/REST endpoints**: | |
| ```bash | |
| curl -I https://simbli.eboardsolutions.com/api/meetings | |
| curl -I https://simbli.eboardsolutions.com/graphql | |
| ``` | |
| ### Example (if API exists) | |
| ```python | |
| import httpx | |
| # Hypothetical API endpoint | |
| async with httpx.AsyncClient() as client: | |
| response = await client.get( | |
| 'https://simbli.eboardsolutions.com/api/v1/meetings', | |
| params={'school_id': 2088}, | |
| headers={'User-Agent': 'eBoard-Mobile/1.0'} | |
| ) | |
| meetings = response.json() | |
| ``` | |
| ### Pros | |
| - β Fastest option | |
| - β No bot detection | |
| - β Free | |
| - β Most reliable | |
| ### Cons | |
| - β Requires reverse engineering skills | |
| - β API may not exist | |
| - β API may require authentication | |
| - β May violate Terms of Service | |
| --- | |
| ## Recommended Approach | |
| ### For Personal/Research Projects (Free) | |
| **Start with Option 1 (Undetected ChromeDriver)** | |
| ```bash | |
| # Install | |
| pip install undetected-chromedriver | |
| # Run test | |
| python agents/scraper_undetected.py | |
| ``` | |
| If that fails, use **manual cookies** (current approach) as fallback. | |
| ### For Production/Reliable Scraping ($) | |
| **Use Option 2 (Residential Proxies)** | |
| Budget: ~$15-75/month depending on volume | |
| Best provider for this use case: **SmartProxy** ($75/month for 5GB) | |
| ```bash | |
| # Sign up at smartproxy.com | |
| # Add credentials to .env | |
| # Enable proxy in scraper | |
| RESIDENTIAL_PROXY_URL=http://proxy.smartproxy.com:10000 | |
| PROXY_USERNAME=your_username | |
| PROXY_PASSWORD=your_password | |
| ``` | |
| ### For Large Scale / Enterprise | |
| **Use Option 3 (Browserless.io or ScrapingBee)** | |
| Budget: $40-100/month | |
| Most reliable, fully managed solution. | |
| --- | |
| ## Implementation Plan | |
| ### Phase 1: Try Free Options | |
| 1. β Install undetected-chromedriver | |
| 2. β Test on Tuscaloosa City Schools | |
| 3. β Measure success rate over 10 runs | |
| 4. If success rate > 80%, use this going forward | |
| ### Phase 2: Add Proxy Support (If Phase 1 Fails) | |
| 1. Add proxy configuration to existing Playwright scraper | |
| 2. Sign up for SmartProxy trial | |
| 3. Test with residential proxy | |
| 4. If successful, add to production | |
| ### Phase 3: Optimize | |
| 1. Add retry logic with exponential backoff | |
| 2. Rotate between different methods | |
| 3. Cache successful cookies for reuse | |
| 4. Monitor success rate and adjust | |
| --- | |
| ## Next Steps | |
| Would you like me to: | |
| 1. **Integrate undetected-chromedriver into the main scraper** (1-click solution) | |
| 2. **Add residential proxy support** to existing code (requires proxy account) | |
| 3. **Try to reverse engineer the eBoard API** (advanced, may take time) | |
| 4. **Create a hybrid approach** that tries multiple methods automatically | |
| Let me know which direction you'd prefer! | |