Karim shoair commited on
Commit
05de0da
·
1 Parent(s): 87facc2

docs: add a page for spiders advanced usage

Browse files
docs/spiders/advanced.md ADDED
@@ -0,0 +1,313 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Advanced usages
2
+
3
+ ## Introduction
4
+
5
+ !!! success "Prerequisites"
6
+
7
+ 1. You've read the [Getting started](getting-started.md) page and know how to create and run a basic spider.
8
+
9
+ This page covers the spider system's advanced features: concurrency control, pause/resume, streaming, lifecycle hooks, statistics, and logging.
10
+
11
+ ## Concurrency Control
12
+
13
+ The spider system uses three class attributes to control how aggressively it crawls:
14
+
15
+ | Attribute | Default | Description |
16
+ |----------------------------------|---------|------------------------------------------------------------------|
17
+ | `concurrent_requests` | `4` | Maximum number of requests being processed at the same time |
18
+ | `concurrent_requests_per_domain` | `0` | Maximum concurrent requests per domain (0 = no per-domain limit) |
19
+ | `download_delay` | `0.0` | Seconds to wait before each request |
20
+
21
+ ```python
22
+ class PoliteSpider(Spider):
23
+ name = "polite"
24
+ start_urls = ["https://example.com"]
25
+
26
+ # Be gentle with the server
27
+ concurrent_requests = 4
28
+ concurrent_requests_per_domain = 2
29
+ download_delay = 1.0 # Wait 1 second between requests
30
+
31
+ async def parse(self, response: Response):
32
+ yield {"title": response.css("title::text").get("")}
33
+ ```
34
+
35
+ When `concurrent_requests_per_domain` is set, each domain gets its own concurrency limiter in addition to the global limit. This is useful when crawling multiple domains simultaneously — you can allow high global concurrency while being polite to each individual domain.
36
+
37
+ !!! tip
38
+
39
+ The `download_delay` parameter adds a fixed wait before every request, regardless of the domain. Use it for simple rate limiting.
40
+
41
+ ### Using uvloop
42
+
43
+ The `start()` method accepts a `use_uvloop` parameter to use the faster [uvloop](https://github.com/MagicStack/uvloop)/[winloop](https://github.com/nicktimko/winloop) event loop implementation, if available:
44
+
45
+ ```python
46
+ result = MySpider().start(use_uvloop=True)
47
+ ```
48
+
49
+ This can improve throughput for I/O-heavy crawls. You'll need to install `uvloop` (Linux/macOS) or `winloop` (Windows) separately.
50
+
51
+ ## Pause & Resume
52
+
53
+ The spider supports graceful pause-and-resume via checkpointing. To enable it, pass a `crawldir` directory to the spider constructor:
54
+
55
+ ```python
56
+ spider = MySpider(crawldir="crawl_data/my_spider")
57
+ result = spider.start()
58
+
59
+ if result.paused:
60
+ print("Crawl was paused. Run again to resume.")
61
+ else:
62
+ print("Crawl completed!")
63
+ ```
64
+
65
+ ### How It Works
66
+
67
+ 1. **Pausing**: Press `Ctrl+C` during a crawl. The spider waits for all in-flight requests to finish, saves a checkpoint (pending requests + a set of seen request fingerprints), and then exits.
68
+ 2. **Force stopping**: Press `Ctrl+C` a second time to stop immediately without waiting for active tasks.
69
+ 3. **Resuming**: Run the spider again with the same `crawldir`. It detects the checkpoint, restores the queue and seen set, and continues from where it left off — skipping `start_requests()`.
70
+ 4. **Cleanup**: When a crawl completes normally (not paused), the checkpoint files are deleted automatically.
71
+
72
+ **Checkpoints are also saved periodically during the crawl (every 5 minutes by default).**
73
+
74
+ You can change the interval as follows:
75
+
76
+ ```python
77
+ # Save checkpoint every 2 minutes
78
+ spider = MySpider(crawldir="crawl_data/my_spider", interval=120.0)
79
+ ```
80
+
81
+ The writing to the disk is atomic, so it's totally safe.
82
+
83
+ !!! tip
84
+
85
+ Pressing `Ctrl+C` during a crawl always causes the spider to close gracefully, even if the checkpoint system is not enabled. Doing it again without waiting forces the spider to close immediately.
86
+
87
+ ### Knowing If You're Resuming
88
+
89
+ The `on_start()` hook receives a `resuming` flag:
90
+
91
+ ```python
92
+ async def on_start(self, resuming: bool = False):
93
+ if resuming:
94
+ self.logger.info("Resuming from checkpoint!")
95
+ else:
96
+ self.logger.info("Starting fresh crawl")
97
+ ```
98
+
99
+ ## Streaming
100
+
101
+ For long-running spiders or applications that need real-time access to scraped items, use the `stream()` method instead of `start()`:
102
+
103
+ ```python
104
+ import anyio
105
+
106
+ async def main():
107
+ spider = MySpider()
108
+ async for item in spider.stream():
109
+ print(f"Got item: {item}")
110
+ # Access real-time stats
111
+ print(f"Items so far: {spider.stats.items_scraped}")
112
+ print(f"Requests made: {spider.stats.requests_count}")
113
+
114
+ anyio.run(main)
115
+ ```
116
+
117
+ Key differences from `start()`:
118
+
119
+ - `stream()` must be called from an async context
120
+ - Items are yielded one by one as they're scraped, not collected into a list
121
+ - You can access `spider.stats` during iteration for real-time statistics
122
+
123
+ !!! abstract
124
+
125
+ The full list of all stats that can be accessed by `spider.stats` is explained below [here](#results--statistics)
126
+
127
+ You can use it with the checkpoint system too, so it's easy to build UI on top of spiders. UIs that have real-time data and can be paused/resumed.
128
+
129
+ ```python
130
+ import anyio
131
+
132
+ async def main():
133
+ spider = MySpider(crawldir="crawl_data/my_spider")
134
+ async for item in spider.stream():
135
+ print(f"Got item: {item}")
136
+ # Access real-time stats
137
+ print(f"Items so far: {spider.stats.items_scraped}")
138
+ print(f"Requests made: {spider.stats.requests_count}")
139
+
140
+ anyio.run(main)
141
+ ```
142
+ You can also use `spider.pause()` to shut down the spider in the code above. If you used it without enabling the checkpoint system, it will just close the crawl.
143
+
144
+ ## Lifecycle Hooks
145
+
146
+ The spider provides several hooks you can override to add custom behavior at different stages of the crawl:
147
+
148
+ ### on_start
149
+
150
+ Called before crawling begins. Use it for setup tasks like loading data or initializing resources:
151
+
152
+ ```python
153
+ async def on_start(self, resuming: bool = False):
154
+ self.logger.info("Spider starting up")
155
+ # Load seed URLs from a database, initialize counters, etc.
156
+ ```
157
+
158
+ ### on_close
159
+
160
+ Called after crawling finishes (whether completed or paused). Use it for cleanup:
161
+
162
+ ```python
163
+ async def on_close(self):
164
+ self.logger.info("Spider shutting down")
165
+ # Close database connections, flush buffers, etc.
166
+ ```
167
+
168
+ ### on_error
169
+
170
+ Called when a request fails with an exception. Use it for error tracking or custom recovery logic:
171
+
172
+ ```python
173
+ async def on_error(self, request: Request, error: Exception):
174
+ self.logger.error(f"Failed: {request.url} - {error}")
175
+ # Log to error tracker, save failed URL for later, etc.
176
+ ```
177
+
178
+ ### on_scraped_item
179
+
180
+ Called for every scraped item before it's added to the results. Return the item (modified or not) to keep it, or return `None` to drop it:
181
+
182
+ ```python
183
+ async def on_scraped_item(self, item: dict) -> dict | None:
184
+ # Drop items without a title
185
+ if not item.get("title"):
186
+ return None
187
+
188
+ # Modify items (e.g., add timestamps)
189
+ item["scraped_at"] = "2026-01-01"
190
+ return item
191
+ ```
192
+
193
+ !!! tip
194
+
195
+ This hook can also be used to direct items through your own pipelines and drop them from the spider.
196
+
197
+ ### start_requests
198
+
199
+ Override `start_requests()` for custom initial request generation instead of using `start_urls`:
200
+
201
+ ```python
202
+ async def start_requests(self):
203
+ # POST request to log in first
204
+ yield Request(
205
+ "https://example.com/login",
206
+ method="POST",
207
+ data={"user": "admin", "pass": "secret"},
208
+ callback=self.after_login,
209
+ )
210
+
211
+ async def after_login(self, response: Response):
212
+ # Now crawl the authenticated pages
213
+ yield response.follow("/dashboard", callback=self.parse)
214
+ ```
215
+
216
+ ## Results & Statistics
217
+
218
+ The `CrawlResult` returned by `start()` contains both the scraped items and detailed statistics:
219
+
220
+ ```python
221
+ result = MySpider().start()
222
+
223
+ # Items
224
+ print(f"Total items: {len(result.items)}")
225
+ result.items.to_json("output.json", indent=True)
226
+
227
+ # Did the crawl complete?
228
+ print(f"Completed: {result.completed}")
229
+ print(f"Paused: {result.paused}")
230
+
231
+ # Statistics
232
+ stats = result.stats
233
+ print(f"Requests: {stats.requests_count}")
234
+ print(f"Failed: {stats.failed_requests_count}")
235
+ print(f"Blocked: {stats.blocked_requests_count}")
236
+ print(f"Offsite filtered: {stats.offsite_requests_count}")
237
+ print(f"Items scraped: {stats.items_scraped}")
238
+ print(f"Items dropped: {stats.items_dropped}")
239
+ print(f"Response bytes: {stats.response_bytes}")
240
+ print(f"Duration: {stats.elapsed_seconds:.1f}s")
241
+ print(f"Speed: {stats.requests_per_second:.1f} req/s")
242
+ ```
243
+
244
+ ### Detailed Stats
245
+
246
+ The `CrawlStats` object tracks granular information:
247
+
248
+ ```python
249
+ stats = result.stats
250
+
251
+ # Status code distribution
252
+ print(stats.response_status_count)
253
+ # {'status_200': 150, 'status_404': 3, 'status_403': 1}
254
+
255
+ # Bytes downloaded per domain
256
+ print(stats.domains_response_bytes)
257
+ # {'example.com': 1234567, 'api.example.com': 45678}
258
+
259
+ # Requests per session
260
+ print(stats.sessions_requests_count)
261
+ # {'http': 120, 'stealth': 34}
262
+
263
+ # Proxies used during the crawl
264
+ print(stats.proxies)
265
+ # ['http://proxy1:8080', 'http://proxy2:8080']
266
+
267
+ # Log level counts
268
+ print(stats.log_levels_counter)
269
+ # {'debug': 200, 'info': 50, 'warning': 3, 'error': 1, 'critical': 0}
270
+
271
+ # Timing information
272
+ print(stats.start_time) # Unix timestamp when crawl started
273
+ print(stats.end_time) # Unix timestamp when crawl finished
274
+ print(stats.download_delay) # The download delay used (seconds)
275
+
276
+ # Concurrency settings used
277
+ print(stats.concurrent_requests) # Global concurrency limit
278
+ print(stats.concurrent_requests_per_domain) # Per-domain concurrency limit
279
+
280
+ # Custom stats (set by your spider code)
281
+ print(stats.custom_stats)
282
+ # {'login_attempts': 3, 'pages_with_errors': 5}
283
+
284
+ # Export everything as a dict
285
+ print(stats.to_dict())
286
+ ```
287
+
288
+ ## Logging
289
+
290
+ The spider has a built-in logger accessible via `self.logger`. It's pre-configured with the spider's name and supports several customization options:
291
+
292
+ | Attribute | Default | Description |
293
+ |-----------------------|--------------------------------------------------------------|----------------------------------------------------|
294
+ | `logging_level` | `logging.DEBUG` | Minimum log level |
295
+ | `logging_format` | `"[%(asctime)s]:({spider_name}) %(levelname)s: %(message)s"` | Log message format |
296
+ | `logging_date_format` | `"%Y-%m-%d %H:%M:%S"` | Date format in log messages |
297
+ | `log_file` | `None` | Path to a log file (in addition to console output) |
298
+
299
+ ```python
300
+ import logging
301
+
302
+ class MySpider(Spider):
303
+ name = "my_spider"
304
+ start_urls = ["https://example.com"]
305
+ logging_level = logging.INFO
306
+ log_file = "logs/my_spider.log"
307
+
308
+ async def parse(self, response: Response):
309
+ self.logger.info(f"Processing {response.url}")
310
+ yield {"title": response.css("title::text").get("")}
311
+ ```
312
+
313
+ The log file directory is created automatically if it doesn't exist. Both console and file output use the same format.
docs/spiders/requests-responses.md CHANGED
@@ -157,11 +157,11 @@ yield response.follow("/dashboard", dont_filter=True, callback=self.parse_dashbo
157
 
158
  You can fine-tune what goes into the fingerprint using class attributes on your spider:
159
 
160
- | Attribute | Default | Effect |
161
- |----------------------|---------|----------------------------------------------------------------------------------------------------------------|
162
  | `fp_include_kwargs` | `False` | Include extra request kwargs (arguments you passed to the session fetch, like headers, etc.) in the fingerprint |
163
- | `fp_keep_fragments` | `False` | Keep URL fragments (`#section`) when computing fingerprints |
164
- | `fp_include_headers` | `False` | Include request headers in the fingerprint |
165
 
166
  For example, if you need to treat `https://example.com/page#section1` and `https://example.com/page#section2` as different URLs:
167
 
 
157
 
158
  You can fine-tune what goes into the fingerprint using class attributes on your spider:
159
 
160
+ | Attribute | Default | Effect |
161
+ |----------------------|---------|-----------------------------------------------------------------------------------------------------------------|
162
  | `fp_include_kwargs` | `False` | Include extra request kwargs (arguments you passed to the session fetch, like headers, etc.) in the fingerprint |
163
+ | `fp_keep_fragments` | `False` | Keep URL fragments (`#section`) when computing fingerprints |
164
+ | `fp_include_headers` | `False` | Include request headers in the fingerprint |
165
 
166
  For example, if you need to treat `https://example.com/page#section1` and `https://example.com/page#section2` as different URLs:
167
 
zensical.toml CHANGED
@@ -36,6 +36,7 @@ nav = [
36
  {"Requests & Responses" = "spiders/requests-responses.md"},
37
  {"Sessions" = "spiders/sessions.md"},
38
  {"Proxy management & Blocking" = "spiders/proxy-blocking.md"},
 
39
  ]},
40
  {"Command Line Interface" = [
41
  {Overview = "cli/overview.md"},
 
36
  {"Requests & Responses" = "spiders/requests-responses.md"},
37
  {"Sessions" = "spiders/sessions.md"},
38
  {"Proxy management & Blocking" = "spiders/proxy-blocking.md"},
39
+ {"Advanced features" = "spiders/advanced.md"}
40
  ]},
41
  {"Command Line Interface" = [
42
  {Overview = "cli/overview.md"},