File size: 9,504 Bytes
87facc2 dec30a8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 | # Spiders sessions
!!! success "Prerequisites"
1. You've read the [Getting started](getting-started.md) page and know how to create and run a basic spider.
2. You're familiar with [Fetchers basics](../fetching/choosing.md) and the differences between HTTP, Dynamic, and Stealthy sessions.
A spider can use multiple fetcher sessions simultaneously — for example, a fast HTTP session for simple pages and a stealth browser session for protected pages. This page shows you how to configure and use sessions.
## What are Sessions?
As you should already know, a session is a pre-configured fetcher instance that stays alive for the duration of the crawl. Instead of creating a new connection or browser for every request, the spider reuses sessions, which is faster and more resource-efficient.
By default, every spider creates a single [FetcherSession](../fetching/static.md). You can add more sessions or swap the default by overriding the `configure_sessions()` method, but you have to use the async version of each session only, as the table shows below:
| Session Type | Use Case |
|-------------------------------------------------|------------------------------------------|
| [FetcherSession](../fetching/static.md) | Fast HTTP requests, no JavaScript |
| [AsyncDynamicSession](../fetching/dynamic.md) | Browser automation, JavaScript rendering |
| [AsyncStealthySession](../fetching/stealthy.md) | Anti-bot bypass, Cloudflare, etc. |
## Configuring Sessions
Override `configure_sessions()` on your spider to set up sessions. The `manager` parameter is a `SessionManager` instance — use `manager.add()` to register sessions:
```python
from scrapling.spiders import Spider, Response
from scrapling.fetchers import FetcherSession
class MySpider(Spider):
name = "my_spider"
start_urls = ["https://example.com"]
def configure_sessions(self, manager):
manager.add("default", FetcherSession())
async def parse(self, response: Response):
yield {"title": response.css("title::text").get("")}
```
The `manager.add()` method takes:
| Argument | Type | Default | Description |
|--------------|-----------|------------|----------------------------------------------|
| `session_id` | `str` | *required* | A name to reference this session in requests |
| `session` | `Session` | *required* | The session instance |
| `default` | `bool` | `False` | Make this the default session |
| `lazy` | `bool` | `False` | Start the session only when first used |
!!! note "Notes:"
1. In all requests, if you don't specify which session to use, the default session is used. The default session is determined in one of two ways:
1. The first session you add to the managed becomes the default automatically.
2. The session that gets `default=True` while added to the manager.
2. The instances you pass of each session don't have to be already started by you; the spider checks on all sessions if they are not already started and starts them.
3. If you want a specific session to start when used only, then use the `lazy` argument while adding that session to the manager. Example: start the browser only when you need it, not with the spider start.
## Multi-Session Spider
Here's a practical example: use a fast HTTP session for listing pages and a stealth browser for detail pages that have bot protection:
```python
from scrapling.spiders import Spider, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession
class ProductSpider(Spider):
name = "products"
start_urls = ["https://shop.example.com/products"]
def configure_sessions(self, manager):
# Fast HTTP for listing pages (default)
manager.add("http", FetcherSession())
# Stealth browser for protected product pages
manager.add("stealth", AsyncStealthySession(
headless=True,
network_idle=True,
))
async def parse(self, response: Response):
for link in response.css("a.product::attr(href)").getall():
# Route product pages through the stealth session
yield response.follow(link, sid="stealth", callback=self.parse_product)
next_page = response.css("a.next::attr(href)").get()
if next_page:
yield response.follow(next_page)
async def parse_product(self, response: Response):
yield {
"name": response.css("h1::text").get(""),
"price": response.css(".price::text").get(""),
}
```
The key is the `sid` parameter — it tells the spider which session to use for each request. When you call `response.follow()` without `sid`, the session ID from the original request is inherited.
Note that the sessions don't have to be from different classes only, but can be the same session, but different instances with different configurations, for example, like below:
```python
from scrapling.spiders import Spider, Response
from scrapling.fetchers import FetcherSession
class ProductSpider(Spider):
name = "products"
start_urls = ["https://shop.example.com/products"]
def configure_sessions(self, manager):
chrome_requests = FetcherSession(impersonate="chrome")
firefox_requests = FetcherSession(impersonate="firefox")
manager.add("chrome", chrome_requests)
manager.add("firefox", firefox_requests)
async def parse(self, response: Response):
for link in response.css("a.product::attr(href)").getall():
yield response.follow(link, callback=self.parse_product)
next_page = response.css("a.next::attr(href)").get()
if next_page:
yield response.follow(next_page, sid="firefox")
async def parse_product(self, response: Response):
yield {
"name": response.css("h1::text").get(""),
"price": response.css(".price::text").get(""),
}
```
Or you can separate concerns and keep a session with its cookies/state for specific requests, etc...
## Session Arguments
Extra keyword arguments passed to a `Request` (or through `response.follow(**kwargs)`) are forwarded to the session's fetch method. This lets you customize individual requests without changing the session configuration:
```python
async def parse(self, response: Response):
# Pass extra headers for this specific request
yield Request(
"https://api.example.com/data",
headers={"Authorization": "Bearer token123"},
callback=self.parse_api,
)
# Use a different HTTP method
yield Request(
"https://example.com/submit",
method="POST",
data={"field": "value"},
sid="firefox",
callback=self.parse_result,
)
```
!!! warning
Normally, when you use `FetcherSession`, `Fetcher`, or `AsyncFetcher`, you specify the HTTP method to use with the corresponding method like `.get()` and `.post()`. But while using `FetcherSession` in spiders, you can't do this. By default, the request is an _HTTP GET_ request; if you want to use another HTTP method, you have to pass it to the `method` argument, as in the above example. The reason for this is to unify the `Request` interface across all session types.
For browser sessions (`AsyncDynamicSession`, `AsyncStealthySession`), you can pass browser-specific arguments like `wait_selector`, `page_action`, or `extra_headers`:
```python
async def parse(self, response: Response):
# Use Cloudflare solver with the `AsyncStealthySession` we configured above
yield Request(
"https://nopecha.com/demo/cloudflare",
sid="stealth",
callback=self.parse_result,
solve_cloudflare=True,
block_webrtc=True,
hide_canvas=True,
google_search=True,
)
yield response.follow(
"/dynamic-page",
sid="browser",
callback=self.parse_dynamic,
wait_selector="div.loaded",
network_idle=True,
)
```
!!! warning
Session arguments (**kwargs) passed from the original request are inherited by `response.follow()`. New kwargs take precedence over inherited ones.
```python
from scrapling.spiders import Spider, Response
from scrapling.fetchers import FetcherSession
class ProductSpider(Spider):
name = "products"
start_urls = ["https://shop.example.com/products"]
def configure_sessions(self, manager):
manager.add("http", FetcherSession(impersonate='chrome'))
async def parse(self, response: Response):
# I don't want the follow request to impersonate a desktop Chrome like the previous request, but a mobile one
# so I override it like this
for link in response.css("a.product::attr(href)").getall():
yield response.follow(link, impersonate="chrome131_android", callback=self.parse_product)
next_page = response.css("a.next::attr(href)").get()
if next_page:
yield Request(next_page)
async def parse_product(self, response: Response):
yield {
"name": response.css("h1::text").get(""),
"price": response.css(".price::text").get(""),
}
```
!!! info
No need to mention that, upon spider closure, the manager automatically checks whether any sessions are still running and closes them before closing the spider. |