Spaces:
Running on CPU Upgrade
Automated eBoard Scraping Solutions
This guide covers fully automated solutions to bypass Incapsula protection without manual cookie extraction.
Summary of Options
| Solution | Cost | Difficulty | Success Rate | Speed |
|---|---|---|---|---|
| 1. Undetected ChromeDriver | Free | Easy | 70-85% | Medium |
| 2. Playwright + Residential Proxies | $10-50/month | Medium | 90-95% | Fast |
| 3. Browser Automation Services | $30-100/month | Easy | 95-99% | Fast |
| 4. Captcha Solving Service | $1-3/1000 solves | Medium | 85-90% | Slow |
Option 1: Undetected ChromeDriver (Recommended for Free Solution)
Why It Works
undetected-chromedriver patches Selenium to bypass bot detection:
- Removes
navigator.webdriverflag - Uses real Chrome binary (not ChromeDriver)
- Randomizes browser fingerprints
- Avoids common detection patterns
Installation
source .venv/bin/activate
pip install undetected-chromedriver
Usage
# Run the new scraper
python agents/scraper_undetected.py
Or integrate into main scraper:
python main.py scrape \
--state AL \
--municipality "Tuscaloosa City Schools" \
--url http://simbli.eboardsolutions.com/index.aspx?s=2088 \
--platform eboard \
--use-undetected \
--max-events 0
Pros
- β Free
- β No external services required
- β Works for most Incapsula sites
- β Easy to implement
Cons
- β May still fail on very strict Incapsula settings
- β Requires GUI environment (can't run headless on some systems)
- β Slower than Playwright
Option 2: Residential Proxies (Best Success Rate)
Why It Works
Incapsula detects datacenter IPs. Residential proxies route through real home IPs that appear legitimate.
Recommended Providers
BrightData (formerly Luminati)
- Cost: ~$15/GB or $500/month unlimited
- Success rate: 95%+
- Rotating residential IPs
- https://brightdata.com
SmartProxy
- Cost: $75/month for 5GB
- Easy to use
- Good for small projects
- https://smartproxy.com
Oxylabs
- Cost: $15/GB
- Enterprise-grade
- https://oxylabs.io
Implementation
# Install
pip install playwright
# Configure proxy in scraper
async with async_playwright() as p:
browser = await p.chromium.launch(
proxy={
'server': 'http://proxy.smartproxy.com:10000',
'username': 'your_username',
'password': 'your_password'
}
)
# ... rest of scraping code
Add to agents/scraper.py
# In _scrape_eboard method, add:
import os
proxy_config = None
if os.getenv('RESIDENTIAL_PROXY_URL'):
proxy_config = {
'server': os.getenv('RESIDENTIAL_PROXY_URL'),
'username': os.getenv('PROXY_USERNAME'),
'password': os.getenv('PROXY_PASSWORD')
}
browser = await p.chromium.launch(
proxy=proxy_config,
headless=True
)
.env Configuration
# Add to .env file
RESIDENTIAL_PROXY_URL=http://proxy.smartproxy.com:10000
PROXY_USERNAME=your_username
PROXY_PASSWORD=your_password
Pros
- β Highest success rate (95%+)
- β Works on any Incapsula configuration
- β Can run headless
- β Fast and reliable
Cons
- β Costs money ($10-50/month for small projects)
- β Requires account setup
- β May have usage limits
Option 3: Browser Automation Services (Easiest)
Why It Works
These services run real browsers in the cloud and handle all anti-bot evasion automatically.
Recommended Services
Browserless.io
- Cost: $40/month for 20 hours
- Managed Playwright/Puppeteer
- Built-in proxy rotation
- https://browserless.io
from playwright.async_api import async_playwright
async with async_playwright() as p:
browser = await p.chromium.connect(
'wss://chrome.browserless.io?token=YOUR_TOKEN'
)
page = await browser.new_page()
await page.goto('https://simbli.eboardsolutions.com/...')
ScrapingBee
- Cost: $49/month for 100k credits
- Handles all anti-bot automatically
- Simple REST API
- https://scrapingbee.com
import requests
response = requests.get(
'https://app.scrapingbee.com/api/v1/',
params={
'api_key': 'YOUR_API_KEY',
'url': 'https://simbli.eboardsolutions.com/...',
'render_js': 'true',
'premium_proxy': 'true'
}
)
content = response.text
Apify
- Cost: $49/month
- Pre-built scrapers for common sites
- Can create custom scrapers
- https://apify.com
Pros
- β Fully managed (no maintenance)
- β Very high success rate
- β Handles updates to anti-bot automatically
- β Can scale easily
Cons
- β Most expensive option
- β Requires external service dependency
- β May have rate limits
Option 4: Captcha Solving Service
Why It Works
If Incapsula shows a CAPTCHA, these services solve it automatically using AI or human workers.
Recommended Services
2Captcha
- Cost: $2.99 per 1000 CAPTCHAs
- Supports reCAPTCHA, hCaptcha, Incapsula
- https://2captcha.com
Anti-Captcha
- Cost: $2 per 1000 CAPTCHAs
- Fast (10-30 seconds)
- https://anti-captcha.com
Implementation
pip install 2captcha-python
from twocaptcha import TwoCaptcha
import os
solver = TwoCaptcha(os.getenv('2CAPTCHA_API_KEY'))
# When Incapsula shows CAPTCHA
try:
result = solver.recaptcha(
sitekey='SITE_KEY_FROM_PAGE',
url='https://simbli.eboardsolutions.com/...'
)
# Inject solution into page
await page.evaluate(f'document.getElementById("g-recaptcha-response").innerHTML="{result["code"]}";')
await page.click('button[type="submit"]')
except Exception as e:
logger.error(f"CAPTCHA solving failed: {e}")
Pros
- β Solves CAPTCHAs automatically
- β Relatively cheap
- β Works with existing scraper
Cons
- β Only useful if CAPTCHA appears
- β Slower (10-30 seconds per solve)
- β Not 100% success rate
- β Costs money per use
Option 5: Reverse Engineer the API
Why It Works
eBoard likely has backend APIs that mobile apps or internal tools use. These APIs may have weaker protection.
How to Find APIs
Use browser DevTools:
# Open eBoard site in Chrome # Press F12 β Network tab # Look for XHR/Fetch requests # Check requests to: # - /api/ # - .ashx files # - .asmx files (SOAP endpoints)Check for mobile app:
- Search App Store / Google Play for "eBoard Solutions"
- Decompile APK to find API endpoints
- Use mitmproxy to intercept app traffic
Look for GraphQL/REST endpoints:
curl -I https://simbli.eboardsolutions.com/api/meetings curl -I https://simbli.eboardsolutions.com/graphql
Example (if API exists)
import httpx
# Hypothetical API endpoint
async with httpx.AsyncClient() as client:
response = await client.get(
'https://simbli.eboardsolutions.com/api/v1/meetings',
params={'school_id': 2088},
headers={'User-Agent': 'eBoard-Mobile/1.0'}
)
meetings = response.json()
Pros
- β Fastest option
- β No bot detection
- β Free
- β Most reliable
Cons
- β Requires reverse engineering skills
- β API may not exist
- β API may require authentication
- β May violate Terms of Service
Recommended Approach
For Personal/Research Projects (Free)
Start with Option 1 (Undetected ChromeDriver)
# Install
pip install undetected-chromedriver
# Run test
python agents/scraper_undetected.py
If that fails, use manual cookies (current approach) as fallback.
For Production/Reliable Scraping ($)
Use Option 2 (Residential Proxies)
Budget: ~$15-75/month depending on volume
Best provider for this use case: SmartProxy ($75/month for 5GB)
# Sign up at smartproxy.com
# Add credentials to .env
# Enable proxy in scraper
RESIDENTIAL_PROXY_URL=http://proxy.smartproxy.com:10000
PROXY_USERNAME=your_username
PROXY_PASSWORD=your_password
For Large Scale / Enterprise
Use Option 3 (Browserless.io or ScrapingBee)
Budget: $40-100/month
Most reliable, fully managed solution.
Implementation Plan
Phase 1: Try Free Options
- β Install undetected-chromedriver
- β Test on Tuscaloosa City Schools
- β Measure success rate over 10 runs
- If success rate > 80%, use this going forward
Phase 2: Add Proxy Support (If Phase 1 Fails)
- Add proxy configuration to existing Playwright scraper
- Sign up for SmartProxy trial
- Test with residential proxy
- If successful, add to production
Phase 3: Optimize
- Add retry logic with exponential backoff
- Rotate between different methods
- Cache successful cookies for reuse
- Monitor success rate and adjust
Next Steps
Would you like me to:
- Integrate undetected-chromedriver into the main scraper (1-click solution)
- Add residential proxy support to existing code (requires proxy account)
- Try to reverse engineer the eBoard API (advanced, may take time)
- Create a hybrid approach that tries multiple methods automatically
Let me know which direction you'd prefer!