File size: 32,617 Bytes
0c156aa 10bebe8 28706dc 10bebe8 5e95d8a 10bebe8 5e95d8a 05f479b 00be88a 03d3c9b 05f479b 062168a 5e95d8a d816dbf 5e95d8a 06c609b 062168a 5e95d8a d816dbf 05f479b d816dbf 05f479b d816dbf 05f479b d816dbf 05f479b 5e95d8a 19df43f 05f479b 5e95d8a 05f479b 5e95d8a 38cd335 5e95d8a 0c156aa e867c7f 1f8449d e867c7f 4ab9cde d816dbf 6492340 d816dbf e867c7f 808845d 1f8449d 808845d 6492340 d816dbf 808845d 4ab9cde 1f8449d 4ab9cde 6492340 4ab9cde 6b802e7 4ab9cde 6b802e7 4ab9cde 4bc3405 587c648 4bc3405 910837a d816dbf 910837a da9a509 d816dbf da9a509 78e069b e867c7f 5e95d8a 05f479b 0136fff 05f479b 0690d65 e0c2f93 a0dfed9 66b9e49 05f479b 5e95d8a d816dbf 5e95d8a d816dbf 5e95d8a d816dbf 05f479b d816dbf 5e95d8a d816dbf 05f479b d816dbf 05f479b d816dbf 05f479b d816dbf 05f479b d816dbf 05f479b d816dbf 5e95d8a 05f479b d816dbf 05f479b 5e95d8a 05f479b d816dbf 05f479b 5e95d8a 05f479b 5e95d8a 05f479b 5e95d8a 05f479b 5e95d8a 05f479b 5e95d8a 05f479b 5e95d8a d816dbf 5e95d8a 05f479b 5e95d8a d816dbf 05f479b 5e95d8a 05f479b d816dbf 5e95d8a d816dbf 5e95d8a d816dbf 5e95d8a d816dbf 5e95d8a d816dbf 5e95d8a 05f479b 5e95d8a d816dbf 5e95d8a d816dbf 5e95d8a 05f479b 5e95d8a d816dbf 5e95d8a d816dbf 5e95d8a 05f479b d816dbf 05f479b 5e95d8a 05f479b 5e95d8a 05f479b d816dbf 05f479b d816dbf 05f479b d816dbf 05f479b e85bf1f d816dbf e85bf1f 05f479b d816dbf 05f479b d816dbf 05f479b d816dbf 05f479b d816dbf 05f479b d816dbf 5e95d8a e0b5a75 05f479b d816dbf 05f479b d816dbf 5e95d8a 05f479b d816dbf 05f479b d816dbf 05f479b d816dbf 05f479b d816dbf 05f479b 5e95d8a 1b37b00 05f479b d816dbf 05f479b 1b37b00 05f479b d816dbf 05f479b d816dbf 05f479b d816dbf 05f479b d816dbf 05f479b d816dbf 05f479b d816dbf 05f479b 4d91a34 05f479b d816dbf 05f479b d816dbf 05f479b d816dbf | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 | <!-- mcp-name: io.github.D4Vinci/Scrapling -->
<h1 align="center">
<a href="https://scrapling.readthedocs.io">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_dark.svg?sanitize=true">
<img alt="Scrapling Poster" src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_light.svg?sanitize=true">
</picture>
</a>
<br>
<small>Effortless Web Scraping for the Modern Web</small>
</h1>
<p align="center">
<a href="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml" alt="Tests">
<img alt="Tests" src="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg"></a>
<a href="https://badge.fury.io/py/Scrapling" alt="PyPI version">
<img alt="PyPI version" src="https://badge.fury.io/py/Scrapling.svg"></a>
<a href="https://clickpy.clickhouse.com/dashboard/scrapling" rel="nofollow"><img src="https://img.shields.io/pypi/dm/scrapling" alt="PyPI package downloads"></a>
<a href="https://github.com/D4Vinci/Scrapling/tree/main/agent-skill" alt="AI Agent Skill directory">
<img alt="Static Badge" src="https://img.shields.io/badge/Skill-black?style=flat&label=Agent&link=https%3A%2F%2Fgithub.com%2FD4Vinci%2FScrapling%2Ftree%2Fmain%2Fagent-skill"></a>
<a href="https://clawhub.ai/D4Vinci/scrapling-official" alt="OpenClaw Skill">
<img alt="OpenClaw Skill" src="https://img.shields.io/badge/Clawhub-darkred?style=flat&label=OpenClaw&link=https%3A%2F%2Fclawhub.ai%2FD4Vinci%2Fscrapling-official"></a>
<br/>
<a href="https://discord.gg/EMgGbDceNQ" alt="Discord" target="_blank">
<img alt="Discord" src="https://img.shields.io/discord/1360786381042880532?style=social&logo=discord&link=https%3A%2F%2Fdiscord.gg%2FEMgGbDceNQ">
</a>
<a href="https://x.com/Scrapling_dev" alt="X (formerly Twitter)">
<img alt="X (formerly Twitter) Follow" src="https://img.shields.io/twitter/follow/Scrapling_dev?style=social&logo=x&link=https%3A%2F%2Fx.com%2FScrapling_dev">
</a>
<br/>
<a href="https://pypi.org/project/scrapling/" alt="Supported Python versions">
<img alt="Supported Python versions" src="https://img.shields.io/pypi/pyversions/scrapling.svg"></a>
</p>
<p align="center">
<a href="https://scrapling.readthedocs.io/en/latest/parsing/selection.html"><strong>éžæã¡ãœãã</strong></a>
·
<a href="https://scrapling.readthedocs.io/en/latest/fetching/choosing.html"><strong>Fetcher ã®éžã³æ¹</strong></a>
·
<a href="https://scrapling.readthedocs.io/en/latest/spiders/architecture.html"><strong>ã¹ãã€ããŒ</strong></a>
·
<a href="https://scrapling.readthedocs.io/en/latest/spiders/proxy-blocking.html"><strong>ãããã·ããŒããŒã·ã§ã³</strong></a>
·
<a href="https://scrapling.readthedocs.io/en/latest/cli/overview.html"><strong>CLI</strong></a>
·
<a href="https://scrapling.readthedocs.io/en/latest/ai/mcp-server.html"><strong>MCP ã¢ãŒã</strong></a>
</p>
Scrapling ã¯ãåäžã®ãªã¯ãšã¹ãããæ¬æ Œçãªã¯ããŒã«ãŸã§ãã¹ãŠãåŠçããé©å¿å Web Scraping ãã¬ãŒã ã¯ãŒã¯ã§ãã
ãã®ããŒãµãŒã¯ãŠã§ããµã€ãã®å€æŽããåŠç¿ããããŒãžãæŽæ°ããããšãã«èŠçŽ ãèªåçã«åé
眮ããŸããFetcher ã¯ããã«äœ¿ãã Cloudflare Turnstile ãªã©ã®ã¢ã³ããããã·ã¹ãã ãåé¿ããŸãããã㊠Spider ãã¬ãŒã ã¯ãŒã¯ã«ãããPause & Resume ãèªå Proxy å転æ©èœãåãã䞊è¡ãã«ã Session ã¯ããŒã«ãžãšã¹ã±ãŒã«ã¢ããã§ããŸã â ãã¹ãŠãããæ°è¡ã® Python ã§ã1 ã€ã®ã©ã€ãã©ãªã劥åãªãã
ãªã¢ã«ã¿ã€ã çµ±èšãš Streaming ã«ããè¶
é«éã¯ããŒã«ãWeb Scraper ã«ãã£ãŠãWeb Scraper ãšäžè¬ãŠãŒã¶ãŒã®ããã«æ§ç¯ããã誰ã«ã§ãäœãããããŸãã
```python
from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
StealthyFetcher.adaptive = True
p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # ã¬ãŒããŒã®äžã§ãŠã§ããµã€ããååŸïŒ
products = p.css('.product', auto_save=True) # ãŠã§ããµã€ãã®ãã¶ã€ã³å€æŽã«èããããŒã¿ãã¹ã¯ã¬ã€ãïŒ
products = p.css('.product', adaptive=True) # åŸã§ãŠã§ããµã€ãã®æ§é ãå€ãã£ããã`adaptive=True`ãæž¡ããŠèŠã€ããïŒ
```
ãŸãã¯æ¬æ Œçãªã¯ããŒã«ãžã¹ã±ãŒã«ã¢ãã
```python
from scrapling.spiders import Spider, Response
class MySpider(Spider):
name = "demo"
start_urls = ["https://example.com/"]
async def parse(self, response: Response):
for item in response.css('.product'):
yield {"title": item.css('h2::text').get()}
MySpider().start()
```
<p align="center">
<a href="https://dataimpulse.com/?utm_source=scrapling&utm_medium=banner&utm_campaign=scrapling" target="_blank" style="display:flex; justify-content:center; padding:4px 0;">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/DataImpulse.png" alt="At DataImpulse, we specialize in developing custom proxy services for your business. Make requests from anywhere, collect data, and enjoy fast connections with our premium proxies." style="max-height:60px;">
</a>
</p>
# ãã©ããã¹ãã³ãµãŒ
<table>
<tr>
<td width="200">
<a href="https://hypersolutions.co/?utm_source=github&utm_medium=readme&utm_campaign=scrapling" target="_blank" title="Bot Protection Bypass API for Akamai, DataDome, Incapsula & Kasada">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/HyperSolutions.png">
</a>
</td>
<td> Scrapling 㯠Cloudflare Turnstile ã«å¯Ÿå¿ããšã³ã¿ãŒãã©ã€ãºã¬ãã«ã®ä¿è·ã«ã¯ã<a href="https://hypersolutions.co?utm_source=github&utm_medium=readme&utm_campaign=scrapling">
<b>Hyper Solutions</b>
</a>ã<b>Akamai</b>ã<b>DataDome</b>ã<b>Kasada</b>ã<b>Incapsula</b>åãã®æå¹ãª antibot ããŒã¯ã³ãçæãã API ãšã³ããã€ã³ããæäŸãã·ã³ãã«ãª API åŒã³åºãã§ããã©ãŠã¶èªååäžèŠã </td>
</tr>
<tr>
<td width="200">
<a href="https://birdproxies.com/t/scrapling" target="_blank" title="At Bird Proxies, we eliminate your pains such as banned IPs, geo restriction, and high costs so you can focus on your work.">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/BirdProxies.jpg">
</a>
</td>
<td>ãããã·ã¯è€éã§é«äŸ¡ã§ããã¹ãã§ã¯ãªããšèãã<a href="https://birdproxies.com/t/scrapling">
<b>BirdProxies</b>
</a>ãæ§ç¯ããŸããã <br /> 195以äžã®ãã±ãŒã·ã§ã³ã®é«éã¬ãžãã³ã·ã£ã«ã»ISPãããã·ãå
¬æ£ãªäŸ¡æ Œèšå®ããããŠæ¬ç©ã®ãµããŒãã <br />
<b>ã©ã³ãã£ã³ã°ããŒãžã§FlappyBird ã²ãŒã ã詊ããŠç¡æããŒã¿ãã²ããïŒ</b>
</td>
</tr>
<tr>
<td width="200">
<a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling" target="_blank" title="Evomi is your Swiss Quality Proxy Provider, starting at $0.49/GB">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/evomi.png">
</a>
</td>
<td>
<a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling">
<b>Evomi</b>
</a>ïŒã¬ãžãã³ã·ã£ã«ãããã·ã $0.49/GB ãããå®å
šã«åœè£
ããã Chromium ã«ããã¹ã¯ã¬ã€ãã³ã°ãã©ãŠã¶ãã¬ãžãã³ã·ã£ã« IPãèªå CAPTCHA 解決ãã¢ã³ãããããã€ãã¹ã</br>
<b>Scraper API ã§æéãªãçµæãååŸãMCP ãš N8N ã®çµ±åã«å¯Ÿå¿ã</b>
</td>
</tr>
<tr>
<td width="200">
<a href="https://tikhub.io/?ref=KarimShoair" target="_blank" title="Unlock the Power of Social Media Data & AI">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/TikHub.jpg">
</a>
</td>
<td>
<a href="https://tikhub.io/?ref=KarimShoair" target="_blank">TikHub.io</a> 㯠TikTokãXãYouTubeãInstagram ãå«ã 16 以äžã®ãã©ãããã©ãŒã ã§ 900 以äžã®å®å®ãã API ãæäŸãã4,000 äžä»¥äžã®ããŒã¿ã»ãããä¿æã<br /> ããã« <a href="https://ai.tikhub.io/?ref=KarimShoair" target="_blank">å²åŒ AI ã¢ãã«</a>ãæäŸ â ClaudeãGPTãGEMINI ãªã©æå€§ 71% ãªãã
</td>
</tr>
<tr>
<td width="200">
<a href="https://www.nsocks.com/?keyword=2p67aivg" target="_blank" title="Scalable Web Data Access for AI Applications">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/nsocks.png">
</a>
</td>
<td>
<a href="https://www.nsocks.com/?keyword=2p67aivg" target="_blank">Nsocks</a> ã¯éçºè
ãã¹ã¯ã¬ã€ããŒåãã®é«éãªã¬ãžãã³ã·ã£ã«ããã³ ISP ãããã·ãæäŸãã°ããŒãã« IP ã«ãã¬ããžãé«ãå¿åæ§ãã¹ããŒããªããŒããŒã·ã§ã³ãèªååãšããŒã¿æœåºã®ããã®ä¿¡é Œæ§ã®é«ãããã©ãŒãã³ã¹ã<a href="https://www.xcrawl.com/?keyword=2p67aivg" target="_blank">Xcrawl</a> ã§å€§èŠæš¡ãŠã§ãã¯ããŒãªã³ã°ãç°¡çŽ åã
</td>
</tr>
<tr>
<td width="200">
<a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting.">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png">
</a>
</td>
<td>
ããŒãããœã³ã³ãéããŠããã¹ã¯ã¬ã€ããŒã¯åãç¶ããŸãã<br />
<a href="https://petrosky.io/d4vinci" target="_blank">PetroSky VPS</a> - ãã³ã¹ãããèªååã®ããã«æ§ç¯ãããã¯ã©ãŠããµãŒããŒãWindows ãš Linux ãã·ã³ãå®å
šå¶åŸ¡ãæé¡ â¬6.99 ããã
</td>
</tr>
<tr>
<td width="200">
<a href="https://substack.thewebscraping.club/p/scrapling-hands-on-guide?utm_source=github&utm_medium=repo&utm_campaign=scrapling" target="_blank" title="The #1 newsletter dedicated to Web Scraping">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/TWSC.png">
</a>
</td>
<td>
<a href="https://substack.thewebscraping.club/p/scrapling-hands-on-guide?utm_source=github&utm_medium=repo&utm_campaign=scrapling" target="_blank">The Web Scraping Club ã§ Scrapling ã®è©³çްã¬ãã¥ãŒ</a>ïŒ2025幎11æïŒããèªã¿ãã ãããWeb ã¹ã¯ã¬ã€ãã³ã°å°éã® No.1 ãã¥ãŒã¹ã¬ã¿ãŒã§ãã
</td>
</tr>
<tr>
<td width="200">
<a href="https://proxy-seller.com/?partner=CU9CAA5TBYFFT2" target="_blank" title="Proxy-Seller provides reliable proxy infrastructure for Web Scraping">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxySeller.png">
</a>
</td>
<td>
<a href="https://proxy-seller.com/?partner=CU9CAA5TBYFFT2" target="_blank">Proxy-Seller</a> 㯠Web ã¹ã¯ã¬ã€ãã³ã°åãã®ä¿¡é Œæ§ã®é«ããããã·ã€ã³ãã©ãæäŸããŠããŸããIPv4ãIPv6ãISPãã¬ãžãã³ã·ã£ã«ãã¢ãã€ã«ãããã·ã«å¯Ÿå¿ããå®å®ããããã©ãŒãã³ã¹ãå¹
åºãå°ççã«ãã¬ããžãäŒæ¥èŠæš¡ã®ããŒã¿åéã«æè»ãªãã©ã³ãåããŠããŸãã
</td>
</tr>
</table>
<i><sub>ããã«åºåã衚瀺ãããã§ããïŒ[ãã¡ã](https://github.com/sponsors/D4Vinci/sponsorships?tier_id=586646)ãã¯ãªãã¯</sub></i>
# ã¹ãã³ãµãŒ
<!-- sponsors -->
<a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
<a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
<a href="https://hasdata.com/?utm_source=github&utm_medium=banner&utm_campaign=D4Vinci" target="_blank" title="The web scraping service that actually beats anti-bot systems!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/hasdata.png"></a>
<a href="https://proxyempire.io/?ref=scrapling&utm_source=scrapling" target="_blank" title="Collect The Data Your Project Needs with the Best Residential Proxies"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxyEmpire.png"></a>
<a href="https://www.webshare.io/?referral_code=48r2m2cd5uz1" target="_blank" title="The Most Reliable Proxy with Unparalleled Performance"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/webshare.png"></a>
<a href="https://browser.cash/?utm_source=D4Vinci&utm_medium=referral" target="_blank" title="Browser Automation & AI Browser Agent Platform"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/browserCash.png"></a>
<!-- /sponsors -->
<i><sub>ããã«åºåã衚瀺ãããã§ããïŒ[ãã¡ã](https://github.com/sponsors/D4Vinci)ãã¯ãªãã¯ããŠãããªãã«åã£ããã£ã¢ãéžæããŠãã ããïŒ</sub></i>
---
## äž»ãªæ©èœ
### Spider â æ¬æ Œçãªã¯ããŒã«ãã¬ãŒã ã¯ãŒã¯
- ð·ïž **Scrapy 颚㮠Spider API**ïŒ`start_urls`ãasync `parse` callbackã`Request`/`Response` ãªããžã§ã¯ãã§ Spider ãå®çŸ©ã
- â¡ **䞊è¡ã¯ããŒã«**ïŒèšå®å¯èœãªäžŠè¡æ°å¶éããã¡ã€ã³ããšã®ã¹ããããªã³ã°ãããŠã³ããŒãé
å»¶ã
- ð **ãã«ã Session ãµããŒã**ïŒHTTP ãªã¯ãšã¹ããšã¹ãã«ã¹ãããã¬ã¹ãã©ãŠã¶ã®çµ±äžã€ã³ã¿ãŒãã§ãŒã¹ â ID ã«ãã£ãŠç°ãªã Session ã«ãªã¯ãšã¹ããã«ãŒãã£ã³ã°ã
- ðŸ **Pause & Resume**ïŒCheckpoint ããŒã¹ã®ã¯ããŒã«æ°žç¶åãCtrl+C ã§æ£åžžã«ã·ã£ããããŠã³ïŒåèµ·åãããšäžæãããšããããåéã
- ð¡ **Streaming ã¢ãŒã**ïŒ`async for item in spider.stream()` ã§ãªã¢ã«ã¿ã€ã çµ±èšãšãšãã«ã¹ã¯ã¬ã€ããããã¢ã€ãã ã Streaming ã§åä¿¡ â UIããã€ãã©ã€ã³ãé·æéå®è¡ã¯ããŒã«ã«æé©ã
- ð¡ïž **ãããã¯ããããªã¯ãšã¹ãã®æ€åº**ïŒã«ã¹ã¿ãã€ãºå¯èœãªããžãã¯ã«ãããããã¯ããããªã¯ãšã¹ãã®èªåæ€åºãšãªãã©ã€ã
- ðŠ **çµã¿èŸŒã¿ãšã¯ã¹ããŒã**ïŒããã¯ãç¬èªã®ãã€ãã©ã€ã³ããŸãã¯çµã¿èŸŒã¿ã® JSON/JSONL ã§çµæããšã¯ã¹ããŒãããããã`result.items.to_json()` / `result.items.to_jsonl()`ã䜿çšã
### Session ãµããŒãä»ãé«åºŠãªãŠã§ããµã€ãååŸ
- **HTTP ãªã¯ãšã¹ã**ïŒ`Fetcher` ã¯ã©ã¹ã§é«éãã€ã¹ãã«ã¹ãª HTTP ãªã¯ãšã¹ãããã©ãŠã¶ã® TLS fingerprintãããããŒãæš¡å£ããHTTP/3 ã䜿çšå¯èœã
- **åçèªã¿èŸŒã¿**ïŒPlaywright ã® Chromium ãš Google Chrome ããµããŒããã `DynamicFetcher` ã¯ã©ã¹ã«ããå®å
šãªãã©ãŠã¶èªååã§åçãŠã§ããµã€ããååŸã
- **ã¢ã³ããããåé¿**ïŒ`StealthyFetcher` ãš fingerprint åœè£
ã«ããé«åºŠãªã¹ãã«ã¹æ©èœãèªååã§ Cloudflare ã® Turnstile/Interstitial ã®ãã¹ãŠã®ã¿ã€ããç°¡åã«åé¿ã
- **Session 管ç**ïŒãªã¯ãšã¹ãéã§ Cookie ãšç¶æ
ã管çããããã® `FetcherSession`ã`StealthySession`ã`DynamicSession` ã¯ã©ã¹ã«ããæ°žç¶ç㪠Session ãµããŒãã
- **Proxy å転**ïŒãã¹ãŠã® Session ã¿ã€ãã«å¯Ÿå¿ããã©ãŠã³ãããã³ãŸãã¯ã«ã¹ã¿ã æŠç¥ã®çµã¿èŸŒã¿ `ProxyRotator`ãããã«ãªã¯ãšã¹ãããšã® Proxy ãªãŒããŒã©ã€ãã
- **ãã¡ã€ã³ãããã¯**ïŒãã©ãŠã¶ããŒã¹ã® Fetcher ã§ç¹å®ã®ãã¡ã€ã³ïŒããã³ãã®ãµããã¡ã€ã³ïŒãžã®ãªã¯ãšã¹ãããããã¯ã
- **async ãµããŒã**ïŒãã¹ãŠã® Fetcher ããã³å°çš async Session ã¯ã©ã¹å
šäœã§ã®å®å
šãª async ãµããŒãã
### é©å¿åã¹ã¯ã¬ã€ãã³ã°ãš AI çµ±å
- ð **ã¹ããŒãèŠçŽ è¿œè·¡**ïŒã€ã³ããªãžã§ã³ããªé¡äŒŒæ§ã¢ã«ãŽãªãºã ã䜿çšããŠãŠã§ããµã€ãã®å€æŽåŸã«èŠçŽ ãåé
眮ã
- ð¯ **ã¹ããŒãæè»éžæ**ïŒCSS ã»ã¬ã¯ã¿ãXPath ã»ã¬ã¯ã¿ããã£ã«ã¿ããŒã¹æ€çŽ¢ãããã¹ãæ€çŽ¢ãæ£èŠè¡šçŸæ€çŽ¢ãªã©ã
- ð **é¡äŒŒèŠçŽ ã®æ€åº**ïŒèŠã€ãã£ãèŠçŽ ã«é¡äŒŒããèŠçŽ ãèªåçã«ç¹å®ã
- ð€ **AI ãšäœ¿çšãã MCP ãµãŒããŒ**ïŒAI æ¯æŽ Web Scraping ãšããŒã¿æœåºã®ããã®çµã¿èŸŒã¿ MCP ãµãŒããŒãMCP ãµãŒããŒã¯ãAIïŒClaude/Cursor ãªã©ïŒã«æž¡ãåã« Scrapling ãæŽ»çšããŠã¿ãŒã²ããã³ã³ãã³ããæœåºãã匷åã§ã«ã¹ã¿ã ãªæ©èœãåããŠãããæäœãé«éåããããŒã¯ã³äœ¿çšéãæå°éã«æããããšã§ã³ã¹ããåæžããŸããïŒ[ãã¢åç»](https://www.youtube.com/watch?v=qyFk3ZNwOxE)ïŒ
### 髿§èœã§å®æŠãã¹ãæžã¿ã®ã¢ãŒããã¯ãã£
- ð **è¶
é«é**ïŒã»ãšãã©ã® Python ã¹ã¯ã¬ã€ãã³ã°ã©ã€ãã©ãªãäžåãæé©åãããããã©ãŒãã³ã¹ã
- ð **ã¡ã¢ãªå¹ç**ïŒæå°ã®ã¡ã¢ãªãããããªã³ãã®ããã®æé©åãããããŒã¿æ§é ãšé
å»¶èªã¿èŸŒã¿ã
- â¡ **é«é JSON ã·ãªã¢ã«å**ïŒæšæºã©ã€ãã©ãªã® 10 åã®é床ã
- ðïž **宿Šãã¹ãæžã¿**ïŒScrapling 㯠92% ã®ãã¹ãã«ãã¬ããžãšå®å
šãªåãã³ãã«ãã¬ããžãåããŠããã ãã§ãªããéå»1幎éã«æ°çŸäººã® Web Scraper ã«ãã£ãŠæ¯æ¥äœ¿çšãããŠããŸããã
### éçºè
/Web Scraper ã«ããããäœéš
- ð¯ **ã€ã³ã¿ã©ã¯ãã£ã Web Scraping Shell**ïŒScrapling çµ±åãã·ã§ãŒãã«ãããcurl ãªã¯ãšã¹ãã Scrapling ãªã¯ãšã¹ãã«å€æãããããã©ãŠã¶ã§ãªã¯ãšã¹ãçµæã衚瀺ããããããªã©ã®æ°ããããŒã«ãåãããªãã·ã§ã³ã®çµã¿èŸŒã¿ IPython Shell ã§ãWeb Scraping ã¹ã¯ãªããã®éçºãå éã
- ð **ã¿ãŒããã«ããçŽæ¥äœ¿çš**ïŒãªãã·ã§ã³ã§ãã³ãŒããäžè¡ãæžããã« Scrapling ã䜿çšã㊠URL ãã¹ã¯ã¬ã€ãã§ããŸãïŒ
- ð ïž **è±å¯ãªããã²ãŒã·ã§ã³ API**ïŒèŠªãå
åŒãåã®ããã²ãŒã·ã§ã³ã¡ãœããã«ããé«åºŠãª DOM ãã©ããŒãµã«ã
- 𧬠**匷åãããããã¹ãåŠç**ïŒçµã¿èŸŒã¿ã®æ£èŠè¡šçŸãã¯ãªãŒãã³ã°ã¡ãœãããæé©åãããæååæäœã
- ð **èªåã»ã¬ã¯ã¿çæ**ïŒä»»æã®èŠçŽ ã«å¯ŸããŠå
ç¢ãª CSS/XPath ã»ã¬ã¯ã¿ãçæã
- ð **銎æã¿ã®ãã API**ïŒScrapy/Parsel ã§äœ¿çšãããŠããåãç䌌èŠçŽ ãæã€ Scrapy/BeautifulSoup ã«äŒŒãèšèšã
- ð **å®å
šãªåã«ãã¬ããž**ïŒåªãã IDE ãµããŒããšã³ãŒãè£å®ã®ããã®å®å
šãªåãã³ããã³ãŒãããŒã¹å
šäœã倿Žã®ãã³ã«**PyRight**ãš**MyPy**ã§èªåçã«ã¹ãã£ã³ãããŸãã
- ð **ããã«äœ¿ãã Docker ã€ã¡ãŒãž**ïŒåãªãªãŒã¹ã§ããã¹ãŠã®ãã©ãŠã¶ãå«ã Docker ã€ã¡ãŒãžãèªåçã«ãã«ãããã³ããã·ã¥ãããŸãã
## ã¯ããã«
æ·±ãæãäžããã«ãScrapling ã«ã§ããããšã®ç°¡åãªæŠèŠããèŠãããŸãããã
### åºæ¬çãªäœ¿ãæ¹
Session ãµããŒãä»ã HTTP ãªã¯ãšã¹ã
```python
from scrapling.fetchers import Fetcher, FetcherSession
with FetcherSession(impersonate='chrome') as session: # Chrome ã® TLS fingerprint ã®ææ°ããŒãžã§ã³ã䜿çš
page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
quotes = page.css('.quote .text::text').getall()
# ãŸãã¯äžåéãã®ãªã¯ãšã¹ãã䜿çš
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()
```
é«åºŠãªã¹ãã«ã¹ã¢ãŒã
```python
from scrapling.fetchers import StealthyFetcher, StealthySession
with StealthySession(headless=True, solve_cloudflare=True) as session: # å®äºãããŸã§ãã©ãŠã¶ãéãããŸãŸã«ãã
page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
data = page.css('#padded_content a').getall()
# ãŸãã¯äžåéãã®ãªã¯ãšã¹ãã¹ã¿ã€ã«ããã®ãªã¯ãšã¹ãã®ããã«ãã©ãŠã¶ãéããå®äºåŸã«éãã
page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
data = page.css('#padded_content a').getall()
```
å®å
šãªãã©ãŠã¶èªåå
```python
from scrapling.fetchers import DynamicFetcher, DynamicSession
with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # å®äºãããŸã§ãã©ãŠã¶ãéãããŸãŸã«ãã
page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
data = page.xpath('//span[@class="text"]/text()').getall() # ã奜ã¿ã§ããã° XPath ã»ã¬ã¯ã¿ã䜿çš
# ãŸãã¯äžåéãã®ãªã¯ãšã¹ãã¹ã¿ã€ã«ããã®ãªã¯ãšã¹ãã®ããã«ãã©ãŠã¶ãéããå®äºåŸã«éãã
page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
data = page.css('.quote .text::text').getall()
```
### Spider
䞊è¡ãªã¯ãšã¹ããè€æ°ã® Session ã¿ã€ããPause & Resume ãåããæ¬æ Œçãªã¯ããŒã©ãŒãæ§ç¯ïŒ
```python
from scrapling.spiders import Spider, Request, Response
class QuotesSpider(Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
concurrent_requests = 10
async def parse(self, response: Response):
for quote in response.css('.quote'):
yield {
"text": quote.css('.text::text').get(),
"author": quote.css('.author::text').get(),
}
next_page = response.css('.next a')
if next_page:
yield response.follow(next_page[0].attrib['href'])
result = QuotesSpider().start()
print(f"{len(result.items)}ä»¶ã®åŒçšãã¹ã¯ã¬ã€ãããŸãã")
result.items.to_json("quotes.json")
```
åäžã® Spider ã§è€æ°ã® Session ã¿ã€ãã䜿çšïŒ
```python
from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession
class MultiSessionSpider(Spider):
name = "multi"
start_urls = ["https://example.com/"]
def configure_sessions(self, manager):
manager.add("fast", FetcherSession(impersonate="chrome"))
manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
async def parse(self, response: Response):
for link in response.css('a::attr(href)').getall():
# ä¿è·ãããããŒãžã¯ã¹ãã«ã¹ Session ãéããŠã«ãŒãã£ã³ã°
if "protected" in link:
yield Request(link, sid="stealth")
else:
yield Request(link, sid="fast", callback=self.parse) # æç€ºç㪠callback
```
Checkpoint ã䜿çšããŠé·æéã®ã¯ããŒã«ãPause & ResumeïŒ
```python
QuotesSpider(crawldir="./crawl_data").start()
```
Ctrl+C ãæŒããšæ£åžžã«äžæåæ¢ãã鲿ã¯èªåçã«ä¿åãããŸããåŸã§ Spider ãå床起åããéã«åã`crawldir`ãæž¡ããšãäžæãããšããããåéããŸãã
### é«åºŠãªããŒã¹ãšããã²ãŒã·ã§ã³
```python
from scrapling.fetchers import Fetcher
# è±å¯ãªèŠçŽ éžæãšããã²ãŒã·ã§ã³
page = Fetcher.get('https://quotes.toscrape.com/')
# è€æ°ã®éžæã¡ãœããã§åŒçšãååŸ
quotes = page.css('.quote') # CSS ã»ã¬ã¯ã¿
quotes = page.xpath('//div[@class="quote"]') # XPath
quotes = page.find_all('div', {'class': 'quote'}) # BeautifulSoup ã¹ã¿ã€ã«
# 以äžãšåã
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(['div'], class_='quote')
quotes = page.find_all(class_='quote') # ãªã©...
# ããã¹ãå
容ã§èŠçŽ ãæ€çŽ¢
quotes = page.find_by_text('quote', tag='div')
# é«åºŠãªããã²ãŒã·ã§ã³
quote_text = page.css('.quote')[0].css('.text::text').get()
quote_text = page.css('.quote').css('.text::text').getall() # ãã§ãŒã³ã»ã¬ã¯ã¿
first_quote = page.css('.quote')[0]
author = first_quote.next_sibling.css('.author::text')
parent_container = first_quote.parent
# èŠçŽ ã®é¢é£æ§ãšé¡äŒŒæ§
similar_elements = first_quote.find_similar()
below_elements = first_quote.below_elements()
```
ãŠã§ããµã€ããååŸããã«ããŒãµãŒãããã«äœ¿çšããããšãã§ããŸãïŒ
```python
from scrapling.parser import Selector
page = Selector("<html>...</html>")
```
ãŸã£ããåãæ¹æ³ã§åäœããŸãïŒ
### éåæ Session 管çã®äŸ
```python
import asyncio
from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession
async with FetcherSession(http3=True) as session: # `FetcherSession` ã¯ã³ã³ããã¹ãã¢ãŠã§ã¢ã§ãåæ/éåæäž¡æ¹ã®ãã¿ãŒã³ã§åäœå¯èœ
page1 = session.get('https://quotes.toscrape.com/')
page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')
# éåæ Session ã®äœ¿çš
async with AsyncStealthySession(max_pages=2) as session:
tasks = []
urls = ['https://example.com/page1', 'https://example.com/page2']
for url in urls:
task = session.fetch(url)
tasks.append(task)
print(session.get_pool_stats()) # ãªãã·ã§ã³ - ãã©ãŠã¶ã¿ãããŒã«ã®ã¹ããŒã¿ã¹ïŒããžãŒ/ããªãŒ/ãšã©ãŒïŒ
results = await asyncio.gather(*tasks)
print(session.get_pool_stats())
```
## CLI ãšã€ã³ã¿ã©ã¯ãã£ã Shell
Scrapling ã«ã¯åŒ·åãªã³ãã³ãã©ã€ã³ã€ã³ã¿ãŒãã§ãŒã¹ãå«ãŸããŠããŸãïŒ
[](https://asciinema.org/a/736339)
ã€ã³ã¿ã©ã¯ãã£ã Web Scraping Shell ãèµ·å
```bash
scrapling shell
```
ããã°ã©ãã³ã°ããã«çŽæ¥ããŒãžããã¡ã€ã«ã«æœåºïŒããã©ã«ãã§`body`ã¿ã°å
ã®ã³ã³ãã³ããæœåºïŒãåºåãã¡ã€ã«ã`.txt`ã§çµããå Žåãã¿ãŒã²ããã®ããã¹ãã³ã³ãã³ããæœåºãããŸãã`.md`ã§çµããå ŽåãHTML ã³ã³ãã³ãã® Markdown 衚çŸã«ãªããŸãã`.html` ã§çµããå ŽåãHTML ã³ã³ãã³ããã®ãã®ã«ãªããŸãã
```bash
scrapling extract get 'https://example.com' content.md
scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # CSS ã»ã¬ã¯ã¿'#fromSkipToProducts'ã«äžèŽãããã¹ãŠã®èŠçŽ
scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless
scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare
```
> [!NOTE]
> MCP ãµãŒããŒãã€ã³ã¿ã©ã¯ãã£ã Web Scraping Shell ãªã©ãä»ã«ãå€ãã®è¿œå æ©èœããããŸããããã®ããŒãžã¯ç°¡æœã«ä¿ã¡ãããšæããŸããå®å
šãªããã¥ã¡ã³ãã¯[ãã¡ã](https://scrapling.readthedocs.io/en/latest/)ãã芧ãã ãã
## ããã©ãŒãã³ã¹ãã³ãããŒã¯
Scrapling ã¯åŒ·åã§ããã ãã§ãªããè¶
é«éã§ãã以äžã®ãã³ãããŒã¯ã¯ãScrapling ã®ããŒãµãŒãä»ã®äººæ°ã©ã€ãã©ãªã®ææ°ããŒãžã§ã³ãšæ¯èŒããŠããŸãã
### ããã¹ãæœåºé床ãã¹ãïŒ5000 åã®ãã¹ããããèŠçŽ ïŒ
| # | ã©ã€ãã©ãª | æé (ms) | vs Scrapling |
|---|:-----------------:|:---------:|:------------:|
| 1 | Scrapling | 2.02 | 1.0x |
| 2 | Parsel/Scrapy | 2.04 | 1.01 |
| 3 | Raw Lxml | 2.54 | 1.257 |
| 4 | PyQuery | 24.17 | ~12x |
| 5 | Selectolax | 82.63 | ~41x |
| 6 | MechanicalSoup | 1549.71 | ~767.1x |
| 7 | BS4 with Lxml | 1584.31 | ~784.3x |
| 8 | BS4 with html5lib | 3391.91 | ~1679.1x |
### èŠçŽ é¡äŒŒæ§ãšããã¹ãæ€çŽ¢ã®ããã©ãŒãã³ã¹
Scrapling ã®é©å¿åèŠçŽ æ€çŽ¢æ©èœã¯ä»£æ¿ææ®µã倧å¹
ã«äžåããŸãïŒ
| ã©ã€ãã©ãª | æé (ms) | vs Scrapling |
|-------------|:---------:|:------------:|
| Scrapling | 2.39 | 1.0x |
| AutoScraper | 12.45 | 5.209x |
> ãã¹ãŠã®ãã³ãããŒã¯ã¯ 100 å以äžã®å®è¡ã®å¹³åã衚ããŸããæ¹æ³è«ã«ã€ããŠã¯[benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py)ãåç
§ããŠãã ããã
## ã€ã³ã¹ããŒã«
Scrapling ã«ã¯ Python 3.10 以äžãå¿
èŠã§ãïŒ
```bash
pip install scrapling
```
ãã®ã€ã³ã¹ããŒã«ã«ã¯ããŒãµãŒãšã³ãžã³ãšãã®äŸåé¢ä¿ã®ã¿ãå«ãŸããŠãããFetcher ãã³ãã³ãã©ã€ã³äŸåé¢ä¿ã¯å«ãŸããŠããŸããã
### ãªãã·ã§ã³ã®äŸåé¢ä¿
1. 以äžã®è¿œå æ©èœãFetcherããŸãã¯ãããã®ã¯ã©ã¹ã®ããããã䜿çšããå Žåã¯ãFetcher ã®äŸåé¢ä¿ãšãã©ãŠã¶ã®äŸåé¢ä¿ã次ã®ããã«ã€ã³ã¹ããŒã«ããå¿
èŠããããŸãïŒ
```bash
pip install "scrapling[fetchers]"
scrapling install # normal install
scrapling install --force # force reinstall
```
ããã«ããããã¹ãŠã®ãã©ãŠã¶ãããã³ãããã®ã·ã¹ãã äŸåé¢ä¿ãšfingerprint æäœäŸåé¢ä¿ãããŠã³ããŒããããŸãã
ãŸãã¯ãã³ãã³ããå®è¡ãã代ããã«ã³ãŒãããã€ã³ã¹ããŒã«ããããšãã§ããŸãïŒ
```python
from scrapling.cli import install
install([], standalone_mode=False) # normal install
install(["--force"], standalone_mode=False) # force reinstall
```
2. è¿œå æ©èœïŒ
- MCP ãµãŒããŒæ©èœãã€ã³ã¹ããŒã«ïŒ
```bash
pip install "scrapling[ai]"
```
- Shell æ©èœïŒWeb Scraping Shell ãš`extract`ã³ãã³ãïŒãã€ã³ã¹ããŒã«ïŒ
```bash
pip install "scrapling[shell]"
```
- ãã¹ãŠãã€ã³ã¹ããŒã«ïŒ
```bash
pip install "scrapling[all]"
```
ãããã®è¿œå æ©èœã®ããããã®åŸïŒãŸã ã€ã³ã¹ããŒã«ããŠããªãå ŽåïŒã`scrapling install`ã§ãã©ãŠã¶ã®äŸåé¢ä¿ãã€ã³ã¹ããŒã«ããå¿
èŠãããããšãå¿ããªãã§ãã ãã
### Docker
DockerHub ããæ¬¡ã®ã³ãã³ãã§ãã¹ãŠã®è¿œå æ©èœãšãã©ãŠã¶ãå«ã Docker ã€ã¡ãŒãžãã€ã³ã¹ããŒã«ããããšãã§ããŸãïŒ
```bash
docker pull pyd4vinci/scrapling
```
ãŸã㯠GitHub ã¬ãžã¹ããªããããŠã³ããŒãïŒ
```bash
docker pull ghcr.io/d4vinci/scrapling:latest
```
ãã®ã€ã¡ãŒãžã¯ãGitHub Actions ãšãªããžããªã®ã¡ã€ã³ãã©ã³ãã䜿çšããŠèªåçã«ãã«ãããã³ããã·ã¥ãããŸãã
## è²¢ç®
è²¢ç®ãæè¿ããŸãïŒå§ããåã«[è²¢ç®ã¬ã€ãã©ã€ã³](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md)ããèªã¿ãã ããã
## å
責äºé
> [!CAUTION]
> ãã®ã©ã€ãã©ãªã¯æè²ããã³ç ç©¶ç®çã®ã¿ã§æäŸãããŠããŸãããã®ã©ã€ãã©ãªã䜿çšããããšã«ãããå°åããã³åœéçãªããŒã¿ã¹ã¯ã¬ã€ãã³ã°ããã³ãã©ã€ãã·ãŒæ³ã«æºæ ããããšã«åæãããã®ãšã¿ãªãããŸããèè
ããã³è²¢ç®è
ã¯ããã®ãœãããŠã§ã¢ã®èª€çšã«ã€ããŠè²¬ä»»ãè² ããŸãããåžžã«ãŠã§ããµã€ãã®å©çšèŠçŽãšrobots.txt ãã¡ã€ã«ãå°éããŠãã ããã
## ð åŒçš
ç ç©¶ç®çã§åœã©ã€ãã©ãªã䜿çšãããå Žåã¯ã以äžã®åèæç®ã§åŒçšããŠãã ããïŒ
```text
@misc{scrapling,
author = {Karim Shoair},
title = {Scrapling},
year = {2024},
url = {https://github.com/D4Vinci/Scrapling},
note = {An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!}
}
```
## ã©ã€ã»ã³ã¹
ãã®äœå㯠BSD-3-Clause ã©ã€ã»ã³ã¹ã®äžã§ã©ã€ã»ã³ã¹ãããŠããŸãã
## è¬èŸ
ãã®ãããžã§ã¯ãã«ã¯æ¬¡ããé©å¿ãããã³ãŒããå«ãŸããŠããŸãïŒ
- ParselïŒBSD ã©ã€ã»ã³ã¹ïŒâ [translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py) ãµãã¢ãžã¥ãŒã«ã«äœ¿çš
---
<div align="center"><small>Karim Shoair ã«ãã£ãŠâ€ïžã§ãã¶ã€ã³ããã³äœæãããŸããã</small></div><br>
|