Spaces:
Sleeping
Sleeping
| # Proxy management and handling Blocks | |
| ## Introduction | |
| !!! success "Prerequisites" | |
| 1. You've read the [Getting started](getting-started.md) page and know how to create and run a basic spider. | |
| 2. You've read the [Sessions](sessions.md) page and understand how to configure sessions. | |
| When scraping at scale, you'll often need to rotate through multiple proxies to avoid rate limits and blocks. Scrapling's `ProxyRotator` makes this straightforward — it works with all session types and integrates with the spider's blocked request retry system. | |
| If you don't know what a proxy is or how to choose a good one, [this guide can help](https://substack.thewebscraping.club/p/everything-about-proxies). | |
| ## ProxyRotator | |
| The `ProxyRotator` class manages a list of proxies and rotates through them automatically. Pass it to any session type via the `proxy_rotator` parameter: | |
| ```python | |
| from scrapling.spiders import Spider, Response | |
| from scrapling.fetchers import FetcherSession, ProxyRotator | |
| class MySpider(Spider): | |
| name = "my_spider" | |
| start_urls = ["https://example.com"] | |
| def configure_sessions(self, manager): | |
| rotator = ProxyRotator([ | |
| "http://proxy1:8080", | |
| "http://proxy2:8080", | |
| "http://user:pass@proxy3:8080", | |
| ]) | |
| manager.add("default", FetcherSession(proxy_rotator=rotator)) | |
| async def parse(self, response: Response): | |
| # Check which proxy was used | |
| print(f"Proxy used: {response.meta.get('proxy')}") | |
| yield {"title": response.css("title::text").get("")} | |
| ``` | |
| Each request automatically gets the next proxy in the rotation. The proxy used is stored in `response.meta["proxy"]` so you can track which proxy fetched which page. | |
| When you use it with browser sessions, you will need some adjustments, like below: | |
| ```python | |
| from scrapling.fetchers import AsyncDynamicSession, AsyncStealthySession, ProxyRotator | |
| # String proxies work for all session types | |
| rotator = ProxyRotator([ | |
| "http://proxy1:8080", | |
| "http://proxy2:8080", | |
| ]) | |
| # Dict proxies (Playwright format) work for browser sessions | |
| rotator = ProxyRotator([ | |
| {"server": "http://proxy1:8080", "username": "user", "password": "pass"}, | |
| {"server": "http://proxy2:8080"}, | |
| ]) | |
| # Then inside the spider | |
| def configure_sessions(self, manager): | |
| rotator = ProxyRotator(["http://proxy1:8080", "http://proxy2:8080"]) | |
| manager.add("browser", AsyncStealthySession(proxy_rotator=rotator)) | |
| ``` | |
| !!! info | |
| 1. You cannot use the `proxy_rotator` argument together with the static `proxy` or `proxies` parameters on the same session. Pick one approach when configuring the session, and override it per request later if you want, as we will show later. | |
| 2. Remember that by default, all browser-based sessions use a persistent browser context with a pool of tabs. However, since browsers can't set a proxy per tab, when you use a `ProxyRotator`, the fetcher will automatically open a separate context for each proxy, with one tab per context. Once the tab's job is done, both the tab and its context are closed. | |
| ## Custom Rotation Strategies | |
| By default, `ProxyRotator` uses cyclic rotation — it iterates through proxies sequentially, wrapping around at the end. | |
| You can provide a custom strategy function to change this behavior, but it has to match the below signature: | |
| ```python | |
| from scrapling.core._types import ProxyType | |
| def my_strategy(proxies: list, current_index: int) -> tuple[ProxyType, int]: | |
| ... | |
| ``` | |
| It receives the list of proxies and the current index, and must return the chosen proxy and the next index. | |
| Below are some examples of custom rotation strategies you can use. | |
| ### Random Rotation | |
| ```python | |
| import random | |
| from scrapling.fetchers import ProxyRotator | |
| def random_strategy(proxies, current_index): | |
| idx = random.randint(0, len(proxies) - 1) | |
| return proxies[idx], idx | |
| rotator = ProxyRotator( | |
| ["http://proxy1:8080", "http://proxy2:8080", "http://proxy3:8080"], | |
| strategy=random_strategy, | |
| ) | |
| ``` | |
| ### Weighted Rotation | |
| ```python | |
| import random | |
| def weighted_strategy(proxies, current_index): | |
| # First proxy gets 60% of traffic, others split the rest | |
| weights = [60] + [40 // (len(proxies) - 1)] * (len(proxies) - 1) | |
| proxy = random.choices(proxies, weights=weights, k=1)[0] | |
| return proxy, current_index # Index doesn't matter for weighted | |
| rotator = ProxyRotator(proxies, strategy=weighted_strategy) | |
| ``` | |
| ## Per-Request Proxy Override | |
| You can override the rotator for individual requests by passing `proxy=` as a keyword argument: | |
| ```python | |
| async def parse(self, response: Response): | |
| # This request uses the rotator's next proxy | |
| yield response.follow("/page1", callback=self.parse_page) | |
| # This request uses a specific proxy, bypassing the rotator | |
| yield response.follow( | |
| "/special-page", | |
| callback=self.parse_page, | |
| proxy="http://special-proxy:8080", | |
| ) | |
| ``` | |
| This is useful when certain pages require a specific proxy (e.g., a geo-located proxy for region-specific content). | |
| ## Blocked Request Handling | |
| The spider has built-in blocked request detection and retry. By default, it considers the following HTTP status codes blocked: `401`, `403`, `407`, `429`, `444`, `500`, `502`, `503`, `504`. | |
| The retry system works like this: | |
| 1. After a response comes back, the spider calls the `is_blocked(response)` method. | |
| 2. If blocked, it copies the request and calls the `retry_blocked_request()` method so you can modify it before retrying. | |
| 3. The retried request is re-queued with `dont_filter=True` (bypassing deduplication) and lower priority, so it's not retried right away. | |
| 4. This repeats up to `max_blocked_retries` times (default: 3). | |
| !!! tip | |
| 1. On retry, the previous `proxy`/`proxies` kwargs are cleared from the request automatically, so the rotator assigns a fresh proxy. | |
| 2. The `max_blocked_retries` attribute is different than the session retries and doesn't share the counter. | |
| ### Custom Block Detection | |
| Override `is_blocked()` to add your own detection logic: | |
| ```python | |
| class MySpider(Spider): | |
| name = "my_spider" | |
| start_urls = ["https://example.com"] | |
| async def is_blocked(self, response: Response) -> bool: | |
| # Check status codes (default behavior) | |
| if response.status in {403, 429, 503}: | |
| return True | |
| # Check response content | |
| body = response.body.decode("utf-8", errors="ignore") | |
| if "access denied" in body.lower() or "rate limit" in body.lower(): | |
| return True | |
| return False | |
| async def parse(self, response: Response): | |
| yield {"title": response.css("title::text").get("")} | |
| ``` | |
| ### Customizing Retries | |
| Override `retry_blocked_request()` to modify the request before retrying. The `max_blocked_retries` attribute controls how many times a blocked request is retried (default: 3): | |
| ```python | |
| from scrapling.spiders import Spider, SessionManager, Request, Response | |
| from scrapling.fetchers import FetcherSession, AsyncStealthySession | |
| class MySpider(Spider): | |
| name = "my_spider" | |
| start_urls = ["https://example.com"] | |
| max_blocked_retries = 5 | |
| def configure_sessions(self, manager: SessionManager) -> None: | |
| manager.add('requests', FetcherSession(impersonate=['chrome', 'firefox', 'safari'])) | |
| manager.add('stealth', AsyncStealthySession(block_webrtc=True), lazy=True) | |
| async def retry_blocked_request(self, request: Request, response: Response) -> Request: | |
| request.sid = "stealth" | |
| self.logger.info(f"Retrying blocked request: {request.url}") | |
| return request | |
| async def parse(self, response: Response): | |
| yield {"title": response.css("title::text").get("")} | |
| ``` | |
| What happened above is that I left the blocking detection logic unchanged and had the spider mainly use requests until it got blocked, then switch to the stealthy browser. | |
| Putting it all together: | |
| ```python | |
| from scrapling.spiders import Spider, SessionManager, Request, Response | |
| from scrapling.fetchers import FetcherSession, AsyncStealthySession, ProxyRotator | |
| cheap_proxies = ProxyRotator([ "http://proxy1:8080", "http://proxy2:8080"]) | |
| # A format acceptable by the browser | |
| expensive_proxies = ProxyRotator([ | |
| {"server": "http://residential_proxy1:8080", "username": "user", "password": "pass"}, | |
| {"server": "http://residential_proxy2:8080", "username": "user", "password": "pass"}, | |
| {"server": "http://mobile_proxy1:8080", "username": "user", "password": "pass"}, | |
| {"server": "http://mobile_proxy2:8080", "username": "user", "password": "pass"}, | |
| ]) | |
| class MySpider(Spider): | |
| name = "my_spider" | |
| start_urls = ["https://example.com"] | |
| max_blocked_retries = 5 | |
| def configure_sessions(self, manager: SessionManager) -> None: | |
| manager.add('requests', FetcherSession(impersonate=['chrome', 'firefox', 'safari'], proxy_rotator=cheap_proxies)) | |
| manager.add('stealth', AsyncStealthySession(block_webrtc=True, proxy_rotator=expensive_proxies), lazy=True) | |
| async def retry_blocked_request(self, request: Request, response: Response) -> Request: | |
| request.sid = "stealth" | |
| self.logger.info(f"Retrying blocked request: {request.url}") | |
| return request | |
| async def parse(self, response: Response): | |
| yield {"title": response.css("title::text").get("")} | |
| ``` | |
| The above logic is: requests are made with cheap proxies, such as datacenter proxies, until they are blocked, then retried with higher-quality proxies, such as residential or mobile proxies. |