File size: 9,245 Bytes
61d29fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
# Automated eBoard Scraping Solutions

This guide covers **fully automated** solutions to bypass Incapsula protection without manual cookie extraction.

---

## Summary of Options

| Solution | Cost | Difficulty | Success Rate | Speed |
|----------|------|------------|--------------|-------|
| **1. Undetected ChromeDriver** | Free | Easy | 70-85% | Medium |
| **2. Playwright + Residential Proxies** | $10-50/month | Medium | 90-95% | Fast |
| **3. Browser Automation Services** | $30-100/month | Easy | 95-99% | Fast |
| **4. Captcha Solving Service** | $1-3/1000 solves | Medium | 85-90% | Slow |

---

## Option 1: Undetected ChromeDriver (Recommended for Free Solution)

### Why It Works
`undetected-chromedriver` patches Selenium to bypass bot detection:
- Removes `navigator.webdriver` flag
- Uses real Chrome binary (not ChromeDriver)
- Randomizes browser fingerprints
- Avoids common detection patterns

### Installation

```bash
source .venv/bin/activate
pip install undetected-chromedriver
```

### Usage

```python
# Run the new scraper
python agents/scraper_undetected.py
```

Or integrate into main scraper:

```bash
python main.py scrape \
  --state AL \
  --municipality "Tuscaloosa City Schools" \
  --url http://simbli.eboardsolutions.com/index.aspx?s=2088 \
  --platform eboard \
  --use-undetected \
  --max-events 0
```

### Pros
- βœ… Free
- βœ… No external services required
- βœ… Works for most Incapsula sites
- βœ… Easy to implement

### Cons
- ❌ May still fail on very strict Incapsula settings
- ❌ Requires GUI environment (can't run headless on some systems)
- ❌ Slower than Playwright

---

## Option 2: Residential Proxies (Best Success Rate)

### Why It Works
Incapsula detects datacenter IPs. Residential proxies route through real home IPs that appear legitimate.

### Recommended Providers

**BrightData (formerly Luminati)**
- Cost: ~$15/GB or $500/month unlimited
- Success rate: 95%+
- Rotating residential IPs
- https://brightdata.com

**SmartProxy**
- Cost: $75/month for 5GB
- Easy to use
- Good for small projects
- https://smartproxy.com

**Oxylabs**
- Cost: $15/GB
- Enterprise-grade
- https://oxylabs.io

### Implementation

```python
# Install
pip install playwright

# Configure proxy in scraper
async with async_playwright() as p:
    browser = await p.chromium.launch(
        proxy={
            'server': 'http://proxy.smartproxy.com:10000',
            'username': 'your_username',
            'password': 'your_password'
        }
    )
    # ... rest of scraping code
```

### Add to agents/scraper.py

```python
# In _scrape_eboard method, add:
import os

proxy_config = None
if os.getenv('RESIDENTIAL_PROXY_URL'):
    proxy_config = {
        'server': os.getenv('RESIDENTIAL_PROXY_URL'),
        'username': os.getenv('PROXY_USERNAME'),
        'password': os.getenv('PROXY_PASSWORD')
    }

browser = await p.chromium.launch(
    proxy=proxy_config,
    headless=True
)
```

### .env Configuration

```bash
# Add to .env file
RESIDENTIAL_PROXY_URL=http://proxy.smartproxy.com:10000
PROXY_USERNAME=your_username
PROXY_PASSWORD=your_password
```

### Pros
- βœ… Highest success rate (95%+)
- βœ… Works on any Incapsula configuration
- βœ… Can run headless
- βœ… Fast and reliable

### Cons
- ❌ Costs money ($10-50/month for small projects)
- ❌ Requires account setup
- ❌ May have usage limits

---

## Option 3: Browser Automation Services (Easiest)

### Why It Works
These services run real browsers in the cloud and handle all anti-bot evasion automatically.

### Recommended Services

**Browserless.io**
- Cost: $40/month for 20 hours
- Managed Playwright/Puppeteer
- Built-in proxy rotation
- https://browserless.io

```python
from playwright.async_api import async_playwright

async with async_playwright() as p:
    browser = await p.chromium.connect(
        'wss://chrome.browserless.io?token=YOUR_TOKEN'
    )
    page = await browser.new_page()
    await page.goto('https://simbli.eboardsolutions.com/...')
```

**ScrapingBee**
- Cost: $49/month for 100k credits
- Handles all anti-bot automatically
- Simple REST API
- https://scrapingbee.com

```python
import requests

response = requests.get(
    'https://app.scrapingbee.com/api/v1/',
    params={
        'api_key': 'YOUR_API_KEY',
        'url': 'https://simbli.eboardsolutions.com/...',
        'render_js': 'true',
        'premium_proxy': 'true'
    }
)
content = response.text
```

**Apify**
- Cost: $49/month
- Pre-built scrapers for common sites
- Can create custom scrapers
- https://apify.com

### Pros
- βœ… Fully managed (no maintenance)
- βœ… Very high success rate
- βœ… Handles updates to anti-bot automatically
- βœ… Can scale easily

### Cons
- ❌ Most expensive option
- ❌ Requires external service dependency
- ❌ May have rate limits

---

## Option 4: Captcha Solving Service

### Why It Works
If Incapsula shows a CAPTCHA, these services solve it automatically using AI or human workers.

### Recommended Services

**2Captcha**
- Cost: $2.99 per 1000 CAPTCHAs
- Supports reCAPTCHA, hCaptcha, Incapsula
- https://2captcha.com

**Anti-Captcha**
- Cost: $2 per 1000 CAPTCHAs
- Fast (10-30 seconds)
- https://anti-captcha.com

### Implementation

```bash
pip install 2captcha-python
```

```python
from twocaptcha import TwoCaptcha
import os

solver = TwoCaptcha(os.getenv('2CAPTCHA_API_KEY'))

# When Incapsula shows CAPTCHA
try:
    result = solver.recaptcha(
        sitekey='SITE_KEY_FROM_PAGE',
        url='https://simbli.eboardsolutions.com/...'
    )
    
    # Inject solution into page
    await page.evaluate(f'document.getElementById("g-recaptcha-response").innerHTML="{result["code"]}";')
    await page.click('button[type="submit"]')
    
except Exception as e:
    logger.error(f"CAPTCHA solving failed: {e}")
```

### Pros
- βœ… Solves CAPTCHAs automatically
- βœ… Relatively cheap
- βœ… Works with existing scraper

### Cons
- ❌ Only useful if CAPTCHA appears
- ❌ Slower (10-30 seconds per solve)
- ❌ Not 100% success rate
- ❌ Costs money per use

---

## Option 5: Reverse Engineer the API

### Why It Works
eBoard likely has backend APIs that mobile apps or internal tools use. These APIs may have weaker protection.

### How to Find APIs

1. **Use browser DevTools**:
   ```bash
   # Open eBoard site in Chrome
   # Press F12 β†’ Network tab
   # Look for XHR/Fetch requests
   # Check requests to:
   #   - /api/
   #   - .ashx files
   #   - .asmx files (SOAP endpoints)
   ```

2. **Check for mobile app**:
   - Search App Store / Google Play for "eBoard Solutions"
   - Decompile APK to find API endpoints
   - Use mitmproxy to intercept app traffic

3. **Look for GraphQL/REST endpoints**:
   ```bash
   curl -I https://simbli.eboardsolutions.com/api/meetings
   curl -I https://simbli.eboardsolutions.com/graphql
   ```

### Example (if API exists)

```python
import httpx

# Hypothetical API endpoint
async with httpx.AsyncClient() as client:
    response = await client.get(
        'https://simbli.eboardsolutions.com/api/v1/meetings',
        params={'school_id': 2088},
        headers={'User-Agent': 'eBoard-Mobile/1.0'}
    )
    meetings = response.json()
```

### Pros
- βœ… Fastest option
- βœ… No bot detection
- βœ… Free
- βœ… Most reliable

### Cons
- ❌ Requires reverse engineering skills
- ❌ API may not exist
- ❌ API may require authentication
- ❌ May violate Terms of Service

---

## Recommended Approach

### For Personal/Research Projects (Free)
**Start with Option 1 (Undetected ChromeDriver)**

```bash
# Install
pip install undetected-chromedriver

# Run test
python agents/scraper_undetected.py
```

If that fails, use **manual cookies** (current approach) as fallback.

### For Production/Reliable Scraping ($)
**Use Option 2 (Residential Proxies)**

Budget: ~$15-75/month depending on volume

Best provider for this use case: **SmartProxy** ($75/month for 5GB)

```bash
# Sign up at smartproxy.com
# Add credentials to .env
# Enable proxy in scraper

RESIDENTIAL_PROXY_URL=http://proxy.smartproxy.com:10000
PROXY_USERNAME=your_username
PROXY_PASSWORD=your_password
```

### For Large Scale / Enterprise
**Use Option 3 (Browserless.io or ScrapingBee)**

Budget: $40-100/month

Most reliable, fully managed solution.

---

## Implementation Plan

### Phase 1: Try Free Options
1. βœ… Install undetected-chromedriver
2. βœ… Test on Tuscaloosa City Schools
3. βœ… Measure success rate over 10 runs
4. If success rate > 80%, use this going forward

### Phase 2: Add Proxy Support (If Phase 1 Fails)
1. Add proxy configuration to existing Playwright scraper
2. Sign up for SmartProxy trial
3. Test with residential proxy
4. If successful, add to production

### Phase 3: Optimize
1. Add retry logic with exponential backoff
2. Rotate between different methods
3. Cache successful cookies for reuse
4. Monitor success rate and adjust

---

## Next Steps

Would you like me to:

1. **Integrate undetected-chromedriver into the main scraper** (1-click solution)
2. **Add residential proxy support** to existing code (requires proxy account)
3. **Try to reverse engineer the eBoard API** (advanced, may take time)
4. **Create a hybrid approach** that tries multiple methods automatically

Let me know which direction you'd prefer!