Spaces:
Sleeping
Sleeping
File size: 9,526 Bytes
94ec243 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 | # Proxy management and handling Blocks
## Introduction
!!! success "Prerequisites"
1. You've read the [Getting started](getting-started.md) page and know how to create and run a basic spider.
2. You've read the [Sessions](sessions.md) page and understand how to configure sessions.
When scraping at scale, you'll often need to rotate through multiple proxies to avoid rate limits and blocks. Scrapling's `ProxyRotator` makes this straightforward — it works with all session types and integrates with the spider's blocked request retry system.
If you don't know what a proxy is or how to choose a good one, [this guide can help](https://substack.thewebscraping.club/p/everything-about-proxies).
## ProxyRotator
The `ProxyRotator` class manages a list of proxies and rotates through them automatically. Pass it to any session type via the `proxy_rotator` parameter:
```python
from scrapling.spiders import Spider, Response
from scrapling.fetchers import FetcherSession, ProxyRotator
class MySpider(Spider):
name = "my_spider"
start_urls = ["https://example.com"]
def configure_sessions(self, manager):
rotator = ProxyRotator([
"http://proxy1:8080",
"http://proxy2:8080",
"http://user:pass@proxy3:8080",
])
manager.add("default", FetcherSession(proxy_rotator=rotator))
async def parse(self, response: Response):
# Check which proxy was used
print(f"Proxy used: {response.meta.get('proxy')}")
yield {"title": response.css("title::text").get("")}
```
Each request automatically gets the next proxy in the rotation. The proxy used is stored in `response.meta["proxy"]` so you can track which proxy fetched which page.
When you use it with browser sessions, you will need some adjustments, like below:
```python
from scrapling.fetchers import AsyncDynamicSession, AsyncStealthySession, ProxyRotator
# String proxies work for all session types
rotator = ProxyRotator([
"http://proxy1:8080",
"http://proxy2:8080",
])
# Dict proxies (Playwright format) work for browser sessions
rotator = ProxyRotator([
{"server": "http://proxy1:8080", "username": "user", "password": "pass"},
{"server": "http://proxy2:8080"},
])
# Then inside the spider
def configure_sessions(self, manager):
rotator = ProxyRotator(["http://proxy1:8080", "http://proxy2:8080"])
manager.add("browser", AsyncStealthySession(proxy_rotator=rotator))
```
!!! info
1. You cannot use the `proxy_rotator` argument together with the static `proxy` or `proxies` parameters on the same session. Pick one approach when configuring the session, and override it per request later if you want, as we will show later.
2. Remember that by default, all browser-based sessions use a persistent browser context with a pool of tabs. However, since browsers can't set a proxy per tab, when you use a `ProxyRotator`, the fetcher will automatically open a separate context for each proxy, with one tab per context. Once the tab's job is done, both the tab and its context are closed.
## Custom Rotation Strategies
By default, `ProxyRotator` uses cyclic rotation — it iterates through proxies sequentially, wrapping around at the end.
You can provide a custom strategy function to change this behavior, but it has to match the below signature:
```python
from scrapling.core._types import ProxyType
def my_strategy(proxies: list, current_index: int) -> tuple[ProxyType, int]:
...
```
It receives the list of proxies and the current index, and must return the chosen proxy and the next index.
Below are some examples of custom rotation strategies you can use.
### Random Rotation
```python
import random
from scrapling.fetchers import ProxyRotator
def random_strategy(proxies, current_index):
idx = random.randint(0, len(proxies) - 1)
return proxies[idx], idx
rotator = ProxyRotator(
["http://proxy1:8080", "http://proxy2:8080", "http://proxy3:8080"],
strategy=random_strategy,
)
```
### Weighted Rotation
```python
import random
def weighted_strategy(proxies, current_index):
# First proxy gets 60% of traffic, others split the rest
weights = [60] + [40 // (len(proxies) - 1)] * (len(proxies) - 1)
proxy = random.choices(proxies, weights=weights, k=1)[0]
return proxy, current_index # Index doesn't matter for weighted
rotator = ProxyRotator(proxies, strategy=weighted_strategy)
```
## Per-Request Proxy Override
You can override the rotator for individual requests by passing `proxy=` as a keyword argument:
```python
async def parse(self, response: Response):
# This request uses the rotator's next proxy
yield response.follow("/page1", callback=self.parse_page)
# This request uses a specific proxy, bypassing the rotator
yield response.follow(
"/special-page",
callback=self.parse_page,
proxy="http://special-proxy:8080",
)
```
This is useful when certain pages require a specific proxy (e.g., a geo-located proxy for region-specific content).
## Blocked Request Handling
The spider has built-in blocked request detection and retry. By default, it considers the following HTTP status codes blocked: `401`, `403`, `407`, `429`, `444`, `500`, `502`, `503`, `504`.
The retry system works like this:
1. After a response comes back, the spider calls the `is_blocked(response)` method.
2. If blocked, it copies the request and calls the `retry_blocked_request()` method so you can modify it before retrying.
3. The retried request is re-queued with `dont_filter=True` (bypassing deduplication) and lower priority, so it's not retried right away.
4. This repeats up to `max_blocked_retries` times (default: 3).
!!! tip
1. On retry, the previous `proxy`/`proxies` kwargs are cleared from the request automatically, so the rotator assigns a fresh proxy.
2. The `max_blocked_retries` attribute is different than the session retries and doesn't share the counter.
### Custom Block Detection
Override `is_blocked()` to add your own detection logic:
```python
class MySpider(Spider):
name = "my_spider"
start_urls = ["https://example.com"]
async def is_blocked(self, response: Response) -> bool:
# Check status codes (default behavior)
if response.status in {403, 429, 503}:
return True
# Check response content
body = response.body.decode("utf-8", errors="ignore")
if "access denied" in body.lower() or "rate limit" in body.lower():
return True
return False
async def parse(self, response: Response):
yield {"title": response.css("title::text").get("")}
```
### Customizing Retries
Override `retry_blocked_request()` to modify the request before retrying. The `max_blocked_retries` attribute controls how many times a blocked request is retried (default: 3):
```python
from scrapling.spiders import Spider, SessionManager, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession
class MySpider(Spider):
name = "my_spider"
start_urls = ["https://example.com"]
max_blocked_retries = 5
def configure_sessions(self, manager: SessionManager) -> None:
manager.add('requests', FetcherSession(impersonate=['chrome', 'firefox', 'safari']))
manager.add('stealth', AsyncStealthySession(block_webrtc=True), lazy=True)
async def retry_blocked_request(self, request: Request, response: Response) -> Request:
request.sid = "stealth"
self.logger.info(f"Retrying blocked request: {request.url}")
return request
async def parse(self, response: Response):
yield {"title": response.css("title::text").get("")}
```
What happened above is that I left the blocking detection logic unchanged and had the spider mainly use requests until it got blocked, then switch to the stealthy browser.
Putting it all together:
```python
from scrapling.spiders import Spider, SessionManager, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession, ProxyRotator
cheap_proxies = ProxyRotator([ "http://proxy1:8080", "http://proxy2:8080"])
# A format acceptable by the browser
expensive_proxies = ProxyRotator([
{"server": "http://residential_proxy1:8080", "username": "user", "password": "pass"},
{"server": "http://residential_proxy2:8080", "username": "user", "password": "pass"},
{"server": "http://mobile_proxy1:8080", "username": "user", "password": "pass"},
{"server": "http://mobile_proxy2:8080", "username": "user", "password": "pass"},
])
class MySpider(Spider):
name = "my_spider"
start_urls = ["https://example.com"]
max_blocked_retries = 5
def configure_sessions(self, manager: SessionManager) -> None:
manager.add('requests', FetcherSession(impersonate=['chrome', 'firefox', 'safari'], proxy_rotator=cheap_proxies))
manager.add('stealth', AsyncStealthySession(block_webrtc=True, proxy_rotator=expensive_proxies), lazy=True)
async def retry_blocked_request(self, request: Request, response: Response) -> Request:
request.sid = "stealth"
self.logger.info(f"Retrying blocked request: {request.url}")
return request
async def parse(self, response: Response):
yield {"title": response.css("title::text").get("")}
```
The above logic is: requests are made with cheap proxies, such as datacenter proxies, until they are blocked, then retried with higher-quality proxies, such as residential or mobile proxies. |