Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

App Files Files Community

open-navigator / website /docs /integrations /eboard-automated.md

jcbowyer

Clean HuggingFace deployment without binary files

61d29fc 28 days ago

preview code

raw

history blame contribute delete

9.25 kB

	# Automated eBoard Scraping Solutions

	This guide covers fully automated solutions to bypass Incapsula protection without manual cookie extraction.

	---

	## Summary of Options

	\| Solution \| Cost \| Difficulty \| Success Rate \| Speed \|
	\|----------\|------\|------------\|--------------\|-------\|
	\| 1. Undetected ChromeDriver \| Free \| Easy \| 70-85% \| Medium \|
	\| 2. Playwright + Residential Proxies \| $10-50/month \| Medium \| 90-95% \| Fast \|
	\| 3. Browser Automation Services \| $30-100/month \| Easy \| 95-99% \| Fast \|
	\| 4. Captcha Solving Service \| $1-3/1000 solves \| Medium \| 85-90% \| Slow \|

	---

	## Option 1: Undetected ChromeDriver (Recommended for Free Solution)

	### Why It Works
	`undetected-chromedriver` patches Selenium to bypass bot detection:
	- Removes `navigator.webdriver` flag
	- Uses real Chrome binary (not ChromeDriver)
	- Randomizes browser fingerprints
	- Avoids common detection patterns

	### Installation

	```bash
	source .venv/bin/activate
	pip install undetected-chromedriver
	```

	### Usage

	```python
	# Run the new scraper
	python agents/scraper_undetected.py
	```

	Or integrate into main scraper:

	```bash
	python main.py scrape \
	--state AL \
	--municipality "Tuscaloosa City Schools" \
	--url http://simbli.eboardsolutions.com/index.aspx?s=2088 \
	--platform eboard \
	--use-undetected \
	--max-events 0
	```

	### Pros
	- ✅ Free
	- ✅ No external services required
	- ✅ Works for most Incapsula sites
	- ✅ Easy to implement

	### Cons
	- ❌ May still fail on very strict Incapsula settings
	- ❌ Requires GUI environment (can't run headless on some systems)
	- ❌ Slower than Playwright

	---

	## Option 2: Residential Proxies (Best Success Rate)

	### Why It Works
	Incapsula detects datacenter IPs. Residential proxies route through real home IPs that appear legitimate.

	### Recommended Providers

	BrightData (formerly Luminati)
	- Cost: ~$15/GB or $500/month unlimited
	- Success rate: 95%+
	- Rotating residential IPs
	- https://brightdata.com

	SmartProxy
	- Cost: $75/month for 5GB
	- Easy to use
	- Good for small projects
	- https://smartproxy.com

	Oxylabs
	- Cost: $15/GB
	- Enterprise-grade
	- https://oxylabs.io

	### Implementation

	```python
	# Install
	pip install playwright

	# Configure proxy in scraper
	async with async_playwright() as p:
	browser = await p.chromium.launch(
	proxy={
	'server': 'http://proxy.smartproxy.com:10000',
	'username': 'your_username',
	'password': 'your_password'
	}
	)
	# ... rest of scraping code
	```

	### Add to agents/scraper.py

	```python
	# In _scrape_eboard method, add:
	import os

	proxy_config = None
	if os.getenv('RESIDENTIAL_PROXY_URL'):
	proxy_config = {
	'server': os.getenv('RESIDENTIAL_PROXY_URL'),
	'username': os.getenv('PROXY_USERNAME'),
	'password': os.getenv('PROXY_PASSWORD')
	}

	browser = await p.chromium.launch(
	proxy=proxy_config,
	headless=True
	)
	```

	### .env Configuration

	```bash
	# Add to .env file
	RESIDENTIAL_PROXY_URL=http://proxy.smartproxy.com:10000
	PROXY_USERNAME=your_username
	PROXY_PASSWORD=your_password
	```

	### Pros
	- ✅ Highest success rate (95%+)
	- ✅ Works on any Incapsula configuration
	- ✅ Can run headless
	- ✅ Fast and reliable

	### Cons
	- ❌ Costs money ($10-50/month for small projects)
	- ❌ Requires account setup
	- ❌ May have usage limits

	---

	## Option 3: Browser Automation Services (Easiest)

	### Why It Works
	These services run real browsers in the cloud and handle all anti-bot evasion automatically.

	### Recommended Services

	Browserless.io
	- Cost: $40/month for 20 hours
	- Managed Playwright/Puppeteer
	- Built-in proxy rotation
	- https://browserless.io

	```python
	from playwright.async_api import async_playwright

	async with async_playwright() as p:
	browser = await p.chromium.connect(
	'wss://chrome.browserless.io?token=YOUR_TOKEN'
	)
	page = await browser.new_page()
	await page.goto('https://simbli.eboardsolutions.com/...')
	```

	ScrapingBee
	- Cost: $49/month for 100k credits
	- Handles all anti-bot automatically
	- Simple REST API
	- https://scrapingbee.com

	```python
	import requests

	response = requests.get(
	'https://app.scrapingbee.com/api/v1/',
	params={
	'api_key': 'YOUR_API_KEY',
	'url': 'https://simbli.eboardsolutions.com/...',
	'render_js': 'true',
	'premium_proxy': 'true'
	}
	)
	content = response.text
	```

	Apify
	- Cost: $49/month
	- Pre-built scrapers for common sites
	- Can create custom scrapers
	- https://apify.com

	### Pros
	- ✅ Fully managed (no maintenance)
	- ✅ Very high success rate
	- ✅ Handles updates to anti-bot automatically
	- ✅ Can scale easily

	### Cons
	- ❌ Most expensive option
	- ❌ Requires external service dependency
	- ❌ May have rate limits

	---

	## Option 4: Captcha Solving Service

	### Why It Works
	If Incapsula shows a CAPTCHA, these services solve it automatically using AI or human workers.

	### Recommended Services

	2Captcha
	- Cost: $2.99 per 1000 CAPTCHAs
	- Supports reCAPTCHA, hCaptcha, Incapsula
	- https://2captcha.com

	Anti-Captcha
	- Cost: $2 per 1000 CAPTCHAs
	- Fast (10-30 seconds)
	- https://anti-captcha.com

	### Implementation

	```bash
	pip install 2captcha-python
	```

	```python
	from twocaptcha import TwoCaptcha
	import os

	solver = TwoCaptcha(os.getenv('2CAPTCHA_API_KEY'))

	# When Incapsula shows CAPTCHA
	try:
	result = solver.recaptcha(
	sitekey='SITE_KEY_FROM_PAGE',
	url='https://simbli.eboardsolutions.com/...'
	)

	# Inject solution into page
	await page.evaluate(f'document.getElementById("g-recaptcha-response").innerHTML="{result["code"]}";')
	await page.click('button[type="submit"]')

	except Exception as e:
	logger.error(f"CAPTCHA solving failed: {e}")
	```

	### Pros
	- ✅ Solves CAPTCHAs automatically
	- ✅ Relatively cheap
	- ✅ Works with existing scraper

	### Cons
	- ❌ Only useful if CAPTCHA appears
	- ❌ Slower (10-30 seconds per solve)
	- ❌ Not 100% success rate
	- ❌ Costs money per use

	---

	## Option 5: Reverse Engineer the API

	### Why It Works
	eBoard likely has backend APIs that mobile apps or internal tools use. These APIs may have weaker protection.

	### How to Find APIs

	1. Use browser DevTools:
	```bash
	# Open eBoard site in Chrome
	# Press F12 → Network tab
	# Look for XHR/Fetch requests
	# Check requests to:
	# - /api/
	# - .ashx files
	# - .asmx files (SOAP endpoints)
	```

	2. Check for mobile app:
	- Search App Store / Google Play for "eBoard Solutions"
	- Decompile APK to find API endpoints
	- Use mitmproxy to intercept app traffic

	3. Look for GraphQL/REST endpoints:
	```bash
	curl -I https://simbli.eboardsolutions.com/api/meetings
	curl -I https://simbli.eboardsolutions.com/graphql
	```

	### Example (if API exists)

	```python
	import httpx

	# Hypothetical API endpoint
	async with httpx.AsyncClient() as client:
	response = await client.get(
	'https://simbli.eboardsolutions.com/api/v1/meetings',
	params={'school_id': 2088},
	headers={'User-Agent': 'eBoard-Mobile/1.0'}
	)
	meetings = response.json()
	```

	### Pros
	- ✅ Fastest option
	- ✅ No bot detection
	- ✅ Free
	- ✅ Most reliable

	### Cons
	- ❌ Requires reverse engineering skills
	- ❌ API may not exist
	- ❌ API may require authentication
	- ❌ May violate Terms of Service

	---

	## Recommended Approach

	### For Personal/Research Projects (Free)
	Start with Option 1 (Undetected ChromeDriver)

	```bash
	# Install
	pip install undetected-chromedriver

	# Run test
	python agents/scraper_undetected.py
	```

	If that fails, use manual cookies (current approach) as fallback.

	### For Production/Reliable Scraping ($)
	Use Option 2 (Residential Proxies)

	Budget: ~$15-75/month depending on volume

	Best provider for this use case: SmartProxy ($75/month for 5GB)

	```bash
	# Sign up at smartproxy.com
	# Add credentials to .env
	# Enable proxy in scraper

	RESIDENTIAL_PROXY_URL=http://proxy.smartproxy.com:10000
	PROXY_USERNAME=your_username
	PROXY_PASSWORD=your_password
	```

	### For Large Scale / Enterprise
	Use Option 3 (Browserless.io or ScrapingBee)

	Budget: $40-100/month

	Most reliable, fully managed solution.

	---

	## Implementation Plan

	### Phase 1: Try Free Options
	1. ✅ Install undetected-chromedriver
	2. ✅ Test on Tuscaloosa City Schools
	3. ✅ Measure success rate over 10 runs
	4. If success rate > 80%, use this going forward

	### Phase 2: Add Proxy Support (If Phase 1 Fails)
	1. Add proxy configuration to existing Playwright scraper
	2. Sign up for SmartProxy trial
	3. Test with residential proxy
	4. If successful, add to production

	### Phase 3: Optimize
	1. Add retry logic with exponential backoff
	2. Rotate between different methods
	3. Cache successful cookies for reuse
	4. Monitor success rate and adjust

	---

	## Next Steps

	Would you like me to:

	1. Integrate undetected-chromedriver into the main scraper (1-click solution)
	2. Add residential proxy support to existing code (requires proxy account)
	3. Try to reverse engineer the eBoard API (advanced, may take time)
	4. Create a hybrid approach that tries multiple methods automatically

	Let me know which direction you'd prefer!