Effortless Web Scraping for the Modern Web
éžæã¡ãœãã · Fetcher ã®éžã³æ¹ · ã¹ãã€ã㌠· ãããã·ããŒããŒã·ã§ã³ · CLI · MCP ã¢ãŒã
Scrapling ã¯ãåäžã®ãªã¯ãšã¹ãããæ¬æ Œçãªã¯ããŒã«ãŸã§ãã¹ãŠãåŠçããé©å¿å Web Scraping ãã¬ãŒã ã¯ãŒã¯ã§ãã
ãã®ããŒãµãŒã¯ãŠã§ããµã€ãã®å€æŽããåŠç¿ããããŒãžãæŽæ°ããããšãã«èŠçŽ ãèªåçã«åé 眮ããŸããFetcher ã¯ããã«äœ¿ãã Cloudflare Turnstile ãªã©ã®ã¢ã³ããããã·ã¹ãã ãåé¿ããŸãããã㊠Spider ãã¬ãŒã ã¯ãŒã¯ã«ãããPause & Resume ãèªå Proxy å転æ©èœãåãã䞊è¡ãã«ã Session ã¯ããŒã«ãžãšã¹ã±ãŒã«ã¢ããã§ããŸã â ãã¹ãŠãããæ°è¡ã® Python ã§ã1 ã€ã®ã©ã€ãã©ãªã劥åãªãã
ãªã¢ã«ã¿ã€ã çµ±èšãš Streaming ã«ããè¶ é«éã¯ããŒã«ãWeb Scraper ã«ãã£ãŠãWeb Scraper ãšäžè¬ãŠãŒã¶ãŒã®ããã«æ§ç¯ããã誰ã«ã§ãäœãããããŸãã
from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
StealthyFetcher.adaptive = True
p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # ã¬ãŒããŒã®äžã§ãŠã§ããµã€ããååŸïŒ
products = p.css('.product', auto_save=True) # ãŠã§ããµã€ãã®ãã¶ã€ã³å€æŽã«èããããŒã¿ãã¹ã¯ã¬ã€ãïŒ
products = p.css('.product', adaptive=True) # åŸã§ãŠã§ããµã€ãã®æ§é ãå€ãã£ããã`adaptive=True`ãæž¡ããŠèŠã€ããïŒ
ãŸãã¯æ¬æ Œçãªã¯ããŒã«ãžã¹ã±ãŒã«ã¢ãã
from scrapling.spiders import Spider, Response
class MySpider(Spider):
name = "demo"
start_urls = ["https://example.com/"]
async def parse(self, response: Response):
for item in response.css('.product'):
yield {"title": item.css('h2::text').get()}
MySpider().start()
ãã©ããã¹ãã³ãµãŒ
|
Scrapling 㯠Cloudflare Turnstile ã«å¯Ÿå¿ããšã³ã¿ãŒãã©ã€ãºã¬ãã«ã®ä¿è·ã«ã¯ã Hyper Solutions ãAkamaiãDataDomeãKasadaãIncapsulaåãã®æå¹ãª antibot ããŒã¯ã³ãçæãã API ãšã³ããã€ã³ããæäŸãã·ã³ãã«ãª API åŒã³åºãã§ããã©ãŠã¶èªååäžèŠã |
|
ãããã·ã¯è€éã§é«äŸ¡ã§ããã¹ãã§ã¯ãªããšèãã
BirdProxies
ãæ§ç¯ããŸããã 195以äžã®ãã±ãŒã·ã§ã³ã®é«éã¬ãžãã³ã·ã£ã«ã»ISPãããã·ãå ¬æ£ãªäŸ¡æ Œèšå®ããããŠæ¬ç©ã®ãµããŒãã ã©ã³ãã£ã³ã°ããŒãžã§FlappyBird ã²ãŒã ã詊ããŠç¡æããŒã¿ãã²ããïŒ |
|
Evomi
ïŒã¬ãžãã³ã·ã£ã«ãããã·ã $0.49/GB ãããå®å
šã«åœè£
ããã Chromium ã«ããã¹ã¯ã¬ã€ãã³ã°ãã©ãŠã¶ãã¬ãžãã³ã·ã£ã« IPãèªå CAPTCHA 解決ãã¢ã³ãããããã€ãã¹ã Scraper API ã§æéãªãçµæãååŸãMCP ãš N8N ã®çµ±åã«å¯Ÿå¿ã |
|
TikHub.io 㯠TikTokãXãYouTubeãInstagram ãå«ã 16 以äžã®ãã©ãããã©ãŒã ã§ 900 以äžã®å®å®ãã API ãæäŸãã4,000 äžä»¥äžã®ããŒã¿ã»ãããä¿æã ããã« å²åŒ AI ã¢ãã«ãæäŸ â ClaudeãGPTãGEMINI ãªã©æå€§ 71% ãªãã |
|
Nsocks ã¯éçºè ãã¹ã¯ã¬ã€ããŒåãã®é«éãªã¬ãžãã³ã·ã£ã«ããã³ ISP ãããã·ãæäŸãã°ããŒãã« IP ã«ãã¬ããžãé«ãå¿åæ§ãã¹ããŒããªããŒããŒã·ã§ã³ãèªååãšããŒã¿æœåºã®ããã®ä¿¡é Œæ§ã®é«ãããã©ãŒãã³ã¹ãXcrawl ã§å€§èŠæš¡ãŠã§ãã¯ããŒãªã³ã°ãç°¡çŽ åã |
|
ããŒãããœã³ã³ãéããŠããã¹ã¯ã¬ã€ããŒã¯åãç¶ããŸãã PetroSky VPS - ãã³ã¹ãããèªååã®ããã«æ§ç¯ãããã¯ã©ãŠããµãŒããŒãWindows ãš Linux ãã·ã³ãå®å šå¶åŸ¡ãæé¡ â¬6.99 ããã |
|
The Web Scraping Club ã§ Scrapling ã®è©³çްã¬ãã¥ãŒïŒ2025幎11æïŒããèªã¿ãã ãããWeb ã¹ã¯ã¬ã€ãã³ã°å°éã® No.1 ãã¥ãŒã¹ã¬ã¿ãŒã§ãã |
|
Proxy-Seller 㯠Web ã¹ã¯ã¬ã€ãã³ã°åãã®ä¿¡é Œæ§ã®é«ããããã·ã€ã³ãã©ãæäŸããŠããŸããIPv4ãIPv6ãISPãã¬ãžãã³ã·ã£ã«ãã¢ãã€ã«ãããã·ã«å¯Ÿå¿ããå®å®ããããã©ãŒãã³ã¹ãå¹ åºãå°ççã«ãã¬ããžãäŒæ¥èŠæš¡ã®ããŒã¿åéã«æè»ãªãã©ã³ãåããŠããŸãã |
ããã«åºåã衚瀺ãããã§ããïŒãã¡ããã¯ãªãã¯
ã¹ãã³ãµãŒ
ããã«åºåã衚瀺ãããã§ããïŒãã¡ããã¯ãªãã¯ããŠãããªãã«åã£ããã£ã¢ãéžæããŠãã ããïŒ
äž»ãªæ©èœ
Spider â æ¬æ Œçãªã¯ããŒã«ãã¬ãŒã ã¯ãŒã¯
- ð·ïž Scrapy 颚㮠Spider APIïŒ
start_urlsãasyncparsecallbackãRequest/Responseãªããžã§ã¯ãã§ Spider ãå®çŸ©ã - ⡠䞊è¡ã¯ããŒã«ïŒèšå®å¯èœãªäžŠè¡æ°å¶éããã¡ã€ã³ããšã®ã¹ããããªã³ã°ãããŠã³ããŒãé å»¶ã
- ð ãã«ã Session ãµããŒãïŒHTTP ãªã¯ãšã¹ããšã¹ãã«ã¹ãããã¬ã¹ãã©ãŠã¶ã®çµ±äžã€ã³ã¿ãŒãã§ãŒã¹ â ID ã«ãã£ãŠç°ãªã Session ã«ãªã¯ãšã¹ããã«ãŒãã£ã³ã°ã
- ðŸ Pause & ResumeïŒCheckpoint ããŒã¹ã®ã¯ããŒã«æ°žç¶åãCtrl+C ã§æ£åžžã«ã·ã£ããããŠã³ïŒåèµ·åãããšäžæãããšããããåéã
- ð¡ Streaming ã¢ãŒãïŒ
async for item in spider.stream()ã§ãªã¢ã«ã¿ã€ã çµ±èšãšãšãã«ã¹ã¯ã¬ã€ããããã¢ã€ãã ã Streaming ã§åä¿¡ â UIããã€ãã©ã€ã³ãé·æéå®è¡ã¯ããŒã«ã«æé©ã - ð¡ïž ãããã¯ããããªã¯ãšã¹ãã®æ€åºïŒã«ã¹ã¿ãã€ãºå¯èœãªããžãã¯ã«ãããããã¯ããããªã¯ãšã¹ãã®èªåæ€åºãšãªãã©ã€ã
- ðŠ çµã¿èŸŒã¿ãšã¯ã¹ããŒãïŒããã¯ãç¬èªã®ãã€ãã©ã€ã³ããŸãã¯çµã¿èŸŒã¿ã® JSON/JSONL ã§çµæããšã¯ã¹ããŒãããããã
result.items.to_json()/result.items.to_jsonl()ã䜿çšã
Session ãµããŒãä»ãé«åºŠãªãŠã§ããµã€ãååŸ
- HTTP ãªã¯ãšã¹ãïŒ
Fetcherã¯ã©ã¹ã§é«éãã€ã¹ãã«ã¹ãª HTTP ãªã¯ãšã¹ãããã©ãŠã¶ã® TLS fingerprintãããããŒãæš¡å£ããHTTP/3 ã䜿çšå¯èœã - åçèªã¿èŸŒã¿ïŒPlaywright ã® Chromium ãš Google Chrome ããµããŒããã
DynamicFetcherã¯ã©ã¹ã«ããå®å šãªãã©ãŠã¶èªååã§åçãŠã§ããµã€ããååŸã - ã¢ã³ããããåé¿ïŒ
StealthyFetcherãš fingerprint åœè£ ã«ããé«åºŠãªã¹ãã«ã¹æ©èœãèªååã§ Cloudflare ã® Turnstile/Interstitial ã®ãã¹ãŠã®ã¿ã€ããç°¡åã«åé¿ã - Session 管çïŒãªã¯ãšã¹ãéã§ Cookie ãšç¶æ
ã管çããããã®
FetcherSessionãStealthySessionãDynamicSessionã¯ã©ã¹ã«ããæ°žç¶ç㪠Session ãµããŒãã - Proxy å転ïŒãã¹ãŠã® Session ã¿ã€ãã«å¯Ÿå¿ããã©ãŠã³ãããã³ãŸãã¯ã«ã¹ã¿ã æŠç¥ã®çµã¿èŸŒã¿
ProxyRotatorãããã«ãªã¯ãšã¹ãããšã® Proxy ãªãŒããŒã©ã€ãã - ãã¡ã€ã³ãããã¯ïŒãã©ãŠã¶ããŒã¹ã® Fetcher ã§ç¹å®ã®ãã¡ã€ã³ïŒããã³ãã®ãµããã¡ã€ã³ïŒãžã®ãªã¯ãšã¹ãããããã¯ã
- async ãµããŒãïŒãã¹ãŠã® Fetcher ããã³å°çš async Session ã¯ã©ã¹å šäœã§ã®å®å šãª async ãµããŒãã
é©å¿åã¹ã¯ã¬ã€ãã³ã°ãš AI çµ±å
- ð ã¹ããŒãèŠçŽ è¿œè·¡ïŒã€ã³ããªãžã§ã³ããªé¡äŒŒæ§ã¢ã«ãŽãªãºã ã䜿çšããŠãŠã§ããµã€ãã®å€æŽåŸã«èŠçŽ ãåé 眮ã
- ð¯ ã¹ããŒãæè»éžæïŒCSS ã»ã¬ã¯ã¿ãXPath ã»ã¬ã¯ã¿ããã£ã«ã¿ããŒã¹æ€çŽ¢ãããã¹ãæ€çŽ¢ãæ£èŠè¡šçŸæ€çŽ¢ãªã©ã
- ð é¡äŒŒèŠçŽ ã®æ€åºïŒèŠã€ãã£ãèŠçŽ ã«é¡äŒŒããèŠçŽ ãèªåçã«ç¹å®ã
- ð€ AI ãšäœ¿çšãã MCP ãµãŒããŒïŒAI æ¯æŽ Web Scraping ãšããŒã¿æœåºã®ããã®çµã¿èŸŒã¿ MCP ãµãŒããŒãMCP ãµãŒããŒã¯ãAIïŒClaude/Cursor ãªã©ïŒã«æž¡ãåã« Scrapling ãæŽ»çšããŠã¿ãŒã²ããã³ã³ãã³ããæœåºãã匷åã§ã«ã¹ã¿ã ãªæ©èœãåããŠãããæäœãé«éåããããŒã¯ã³äœ¿çšéãæå°éã«æããããšã§ã³ã¹ããåæžããŸããïŒãã¢åç»ïŒ
髿§èœã§å®æŠãã¹ãæžã¿ã®ã¢ãŒããã¯ãã£
- ð è¶ é«éïŒã»ãšãã©ã® Python ã¹ã¯ã¬ã€ãã³ã°ã©ã€ãã©ãªãäžåãæé©åãããããã©ãŒãã³ã¹ã
- ð ã¡ã¢ãªå¹çïŒæå°ã®ã¡ã¢ãªãããããªã³ãã®ããã®æé©åãããããŒã¿æ§é ãšé å»¶èªã¿èŸŒã¿ã
- â¡ é«é JSON ã·ãªã¢ã«åïŒæšæºã©ã€ãã©ãªã® 10 åã®é床ã
- ðïž å®æŠãã¹ãæžã¿ïŒScrapling 㯠92% ã®ãã¹ãã«ãã¬ããžãšå®å šãªåãã³ãã«ãã¬ããžãåããŠããã ãã§ãªããéå»1幎éã«æ°çŸäººã® Web Scraper ã«ãã£ãŠæ¯æ¥äœ¿çšãããŠããŸããã
éçºè /Web Scraper ã«ããããäœéš
- ð¯ ã€ã³ã¿ã©ã¯ãã£ã Web Scraping ShellïŒScrapling çµ±åãã·ã§ãŒãã«ãããcurl ãªã¯ãšã¹ãã Scrapling ãªã¯ãšã¹ãã«å€æãããããã©ãŠã¶ã§ãªã¯ãšã¹ãçµæã衚瀺ããããããªã©ã®æ°ããããŒã«ãåãããªãã·ã§ã³ã®çµã¿èŸŒã¿ IPython Shell ã§ãWeb Scraping ã¹ã¯ãªããã®éçºãå éã
- ð ã¿ãŒããã«ããçŽæ¥äœ¿çšïŒãªãã·ã§ã³ã§ãã³ãŒããäžè¡ãæžããã« Scrapling ã䜿çšã㊠URL ãã¹ã¯ã¬ã€ãã§ããŸãïŒ
- ð ïž è±å¯ãªããã²ãŒã·ã§ã³ APIïŒèŠªãå åŒãåã®ããã²ãŒã·ã§ã³ã¡ãœããã«ããé«åºŠãª DOM ãã©ããŒãµã«ã
- 𧬠匷åãããããã¹ãåŠçïŒçµã¿èŸŒã¿ã®æ£èŠè¡šçŸãã¯ãªãŒãã³ã°ã¡ãœãããæé©åãããæååæäœã
- ð èªåã»ã¬ã¯ã¿çæïŒä»»æã®èŠçŽ ã«å¯ŸããŠå ç¢ãª CSS/XPath ã»ã¬ã¯ã¿ãçæã
- ð 銎æã¿ã®ãã APIïŒScrapy/Parsel ã§äœ¿çšãããŠããåãç䌌èŠçŽ ãæã€ Scrapy/BeautifulSoup ã«äŒŒãèšèšã
- ð å®å šãªåã«ãã¬ããžïŒåªãã IDE ãµããŒããšã³ãŒãè£å®ã®ããã®å®å šãªåãã³ããã³ãŒãããŒã¹å šäœã倿Žã®ãã³ã«PyRightãšMyPyã§èªåçã«ã¹ãã£ã³ãããŸãã
- ð ããã«äœ¿ãã Docker ã€ã¡ãŒãžïŒåãªãªãŒã¹ã§ããã¹ãŠã®ãã©ãŠã¶ãå«ã Docker ã€ã¡ãŒãžãèªåçã«ãã«ãããã³ããã·ã¥ãããŸãã
ã¯ããã«
æ·±ãæãäžããã«ãScrapling ã«ã§ããããšã®ç°¡åãªæŠèŠããèŠãããŸãããã
åºæ¬çãªäœ¿ãæ¹
Session ãµããŒãä»ã HTTP ãªã¯ãšã¹ã
from scrapling.fetchers import Fetcher, FetcherSession
with FetcherSession(impersonate='chrome') as session: # Chrome ã® TLS fingerprint ã®ææ°ããŒãžã§ã³ã䜿çš
page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
quotes = page.css('.quote .text::text').getall()
# ãŸãã¯äžåéãã®ãªã¯ãšã¹ãã䜿çš
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()
é«åºŠãªã¹ãã«ã¹ã¢ãŒã
from scrapling.fetchers import StealthyFetcher, StealthySession
with StealthySession(headless=True, solve_cloudflare=True) as session: # å®äºãããŸã§ãã©ãŠã¶ãéãããŸãŸã«ãã
page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
data = page.css('#padded_content a').getall()
# ãŸãã¯äžåéãã®ãªã¯ãšã¹ãã¹ã¿ã€ã«ããã®ãªã¯ãšã¹ãã®ããã«ãã©ãŠã¶ãéããå®äºåŸã«éãã
page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
data = page.css('#padded_content a').getall()
å®å šãªãã©ãŠã¶èªåå
from scrapling.fetchers import DynamicFetcher, DynamicSession
with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # å®äºãããŸã§ãã©ãŠã¶ãéãããŸãŸã«ãã
page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
data = page.xpath('//span[@class="text"]/text()').getall() # ã奜ã¿ã§ããã° XPath ã»ã¬ã¯ã¿ã䜿çš
# ãŸãã¯äžåéãã®ãªã¯ãšã¹ãã¹ã¿ã€ã«ããã®ãªã¯ãšã¹ãã®ããã«ãã©ãŠã¶ãéããå®äºåŸã«éãã
page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
data = page.css('.quote .text::text').getall()
Spider
䞊è¡ãªã¯ãšã¹ããè€æ°ã® Session ã¿ã€ããPause & Resume ãåããæ¬æ Œçãªã¯ããŒã©ãŒãæ§ç¯ïŒ
from scrapling.spiders import Spider, Request, Response
class QuotesSpider(Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
concurrent_requests = 10
async def parse(self, response: Response):
for quote in response.css('.quote'):
yield {
"text": quote.css('.text::text').get(),
"author": quote.css('.author::text').get(),
}
next_page = response.css('.next a')
if next_page:
yield response.follow(next_page[0].attrib['href'])
result = QuotesSpider().start()
print(f"{len(result.items)}ä»¶ã®åŒçšãã¹ã¯ã¬ã€ãããŸãã")
result.items.to_json("quotes.json")
åäžã® Spider ã§è€æ°ã® Session ã¿ã€ãã䜿çšïŒ
from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession
class MultiSessionSpider(Spider):
name = "multi"
start_urls = ["https://example.com/"]
def configure_sessions(self, manager):
manager.add("fast", FetcherSession(impersonate="chrome"))
manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
async def parse(self, response: Response):
for link in response.css('a::attr(href)').getall():
# ä¿è·ãããããŒãžã¯ã¹ãã«ã¹ Session ãéããŠã«ãŒãã£ã³ã°
if "protected" in link:
yield Request(link, sid="stealth")
else:
yield Request(link, sid="fast", callback=self.parse) # æç€ºç㪠callback
Checkpoint ã䜿çšããŠé·æéã®ã¯ããŒã«ãPause & ResumeïŒ
QuotesSpider(crawldir="./crawl_data").start()
Ctrl+C ãæŒããšæ£åžžã«äžæåæ¢ãã鲿ã¯èªåçã«ä¿åãããŸããåŸã§ Spider ãå床起åããéã«åãcrawldirãæž¡ããšãäžæãããšããããåéããŸãã
é«åºŠãªããŒã¹ãšããã²ãŒã·ã§ã³
from scrapling.fetchers import Fetcher
# è±å¯ãªèŠçŽ éžæãšããã²ãŒã·ã§ã³
page = Fetcher.get('https://quotes.toscrape.com/')
# è€æ°ã®éžæã¡ãœããã§åŒçšãååŸ
quotes = page.css('.quote') # CSS ã»ã¬ã¯ã¿
quotes = page.xpath('//div[@class="quote"]') # XPath
quotes = page.find_all('div', {'class': 'quote'}) # BeautifulSoup ã¹ã¿ã€ã«
# 以äžãšåã
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(['div'], class_='quote')
quotes = page.find_all(class_='quote') # ãªã©...
# ããã¹ãå
容ã§èŠçŽ ãæ€çŽ¢
quotes = page.find_by_text('quote', tag='div')
# é«åºŠãªããã²ãŒã·ã§ã³
quote_text = page.css('.quote')[0].css('.text::text').get()
quote_text = page.css('.quote').css('.text::text').getall() # ãã§ãŒã³ã»ã¬ã¯ã¿
first_quote = page.css('.quote')[0]
author = first_quote.next_sibling.css('.author::text')
parent_container = first_quote.parent
# èŠçŽ ã®é¢é£æ§ãšé¡äŒŒæ§
similar_elements = first_quote.find_similar()
below_elements = first_quote.below_elements()
ãŠã§ããµã€ããååŸããã«ããŒãµãŒãããã«äœ¿çšããããšãã§ããŸãïŒ
from scrapling.parser import Selector
page = Selector("<html>...</html>")
ãŸã£ããåãæ¹æ³ã§åäœããŸãïŒ
éåæ Session 管çã®äŸ
import asyncio
from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession
async with FetcherSession(http3=True) as session: # `FetcherSession` ã¯ã³ã³ããã¹ãã¢ãŠã§ã¢ã§ãåæ/éåæäž¡æ¹ã®ãã¿ãŒã³ã§åäœå¯èœ
page1 = session.get('https://quotes.toscrape.com/')
page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')
# éåæ Session ã®äœ¿çš
async with AsyncStealthySession(max_pages=2) as session:
tasks = []
urls = ['https://example.com/page1', 'https://example.com/page2']
for url in urls:
task = session.fetch(url)
tasks.append(task)
print(session.get_pool_stats()) # ãªãã·ã§ã³ - ãã©ãŠã¶ã¿ãããŒã«ã®ã¹ããŒã¿ã¹ïŒããžãŒ/ããªãŒ/ãšã©ãŒïŒ
results = await asyncio.gather(*tasks)
print(session.get_pool_stats())
CLI ãšã€ã³ã¿ã©ã¯ãã£ã Shell
Scrapling ã«ã¯åŒ·åãªã³ãã³ãã©ã€ã³ã€ã³ã¿ãŒãã§ãŒã¹ãå«ãŸããŠããŸãïŒ
ã€ã³ã¿ã©ã¯ãã£ã Web Scraping Shell ãèµ·å
scrapling shell
ããã°ã©ãã³ã°ããã«çŽæ¥ããŒãžããã¡ã€ã«ã«æœåºïŒããã©ã«ãã§bodyã¿ã°å
ã®ã³ã³ãã³ããæœåºïŒãåºåãã¡ã€ã«ã.txtã§çµããå Žåãã¿ãŒã²ããã®ããã¹ãã³ã³ãã³ããæœåºãããŸãã.mdã§çµããå ŽåãHTML ã³ã³ãã³ãã® Markdown 衚çŸã«ãªããŸãã.html ã§çµããå ŽåãHTML ã³ã³ãã³ããã®ãã®ã«ãªããŸãã
scrapling extract get 'https://example.com' content.md
scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # CSS ã»ã¬ã¯ã¿'#fromSkipToProducts'ã«äžèŽãããã¹ãŠã®èŠçŽ
scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless
scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare
MCP ãµãŒããŒãã€ã³ã¿ã©ã¯ãã£ã Web Scraping Shell ãªã©ãä»ã«ãå€ãã®è¿œå æ©èœããããŸããããã®ããŒãžã¯ç°¡æœã«ä¿ã¡ãããšæããŸããå®å šãªããã¥ã¡ã³ãã¯ãã¡ããã芧ãã ãã
ããã©ãŒãã³ã¹ãã³ãããŒã¯
Scrapling ã¯åŒ·åã§ããã ãã§ãªããè¶ é«éã§ãã以äžã®ãã³ãããŒã¯ã¯ãScrapling ã®ããŒãµãŒãä»ã®äººæ°ã©ã€ãã©ãªã®ææ°ããŒãžã§ã³ãšæ¯èŒããŠããŸãã
ããã¹ãæœåºé床ãã¹ãïŒ5000 åã®ãã¹ããããèŠçŽ ïŒ
| # | ã©ã€ãã©ãª | æé (ms) | vs Scrapling |
|---|---|---|---|
| 1 | Scrapling | 2.02 | 1.0x |
| 2 | Parsel/Scrapy | 2.04 | 1.01 |
| 3 | Raw Lxml | 2.54 | 1.257 |
| 4 | PyQuery | 24.17 | ~12x |
| 5 | Selectolax | 82.63 | ~41x |
| 6 | MechanicalSoup | 1549.71 | ~767.1x |
| 7 | BS4 with Lxml | 1584.31 | ~784.3x |
| 8 | BS4 with html5lib | 3391.91 | ~1679.1x |
èŠçŽ é¡äŒŒæ§ãšããã¹ãæ€çŽ¢ã®ããã©ãŒãã³ã¹
Scrapling ã®é©å¿åèŠçŽ æ€çŽ¢æ©èœã¯ä»£æ¿ææ®µãå€§å¹ ã«äžåããŸãïŒ
| ã©ã€ãã©ãª | æé (ms) | vs Scrapling |
|---|---|---|
| Scrapling | 2.39 | 1.0x |
| AutoScraper | 12.45 | 5.209x |
ãã¹ãŠã®ãã³ãããŒã¯ã¯ 100 å以äžã®å®è¡ã®å¹³åã衚ããŸããæ¹æ³è«ã«ã€ããŠã¯benchmarks.pyãåç §ããŠãã ããã
ã€ã³ã¹ããŒã«
Scrapling ã«ã¯ Python 3.10 以äžãå¿ èŠã§ãïŒ
pip install scrapling
ãã®ã€ã³ã¹ããŒã«ã«ã¯ããŒãµãŒãšã³ãžã³ãšãã®äŸåé¢ä¿ã®ã¿ãå«ãŸããŠãããFetcher ãã³ãã³ãã©ã€ã³äŸåé¢ä¿ã¯å«ãŸããŠããŸããã
ãªãã·ã§ã³ã®äŸåé¢ä¿
以äžã®è¿œå æ©èœãFetcherããŸãã¯ãããã®ã¯ã©ã¹ã®ããããã䜿çšããå Žåã¯ãFetcher ã®äŸåé¢ä¿ãšãã©ãŠã¶ã®äŸåé¢ä¿ã次ã®ããã«ã€ã³ã¹ããŒã«ããå¿ èŠããããŸãïŒ
pip install "scrapling[fetchers]" scrapling install # normal install scrapling install --force # force reinstallããã«ããããã¹ãŠã®ãã©ãŠã¶ãããã³ãããã®ã·ã¹ãã äŸåé¢ä¿ãšfingerprint æäœäŸåé¢ä¿ãããŠã³ããŒããããŸãã
ãŸãã¯ãã³ãã³ããå®è¡ãã代ããã«ã³ãŒãããã€ã³ã¹ããŒã«ããããšãã§ããŸãïŒ
from scrapling.cli import install install([], standalone_mode=False) # normal install install(["--force"], standalone_mode=False) # force reinstallè¿œå æ©èœïŒ
- MCP ãµãŒããŒæ©èœãã€ã³ã¹ããŒã«ïŒ
pip install "scrapling[ai]" - Shell æ©èœïŒWeb Scraping Shell ãš
extractã³ãã³ãïŒãã€ã³ã¹ããŒã«ïŒpip install "scrapling[shell]" - ãã¹ãŠãã€ã³ã¹ããŒã«ïŒ
ãããã®è¿œå æ©èœã®ããããã®åŸïŒãŸã ã€ã³ã¹ããŒã«ããŠããªãå ŽåïŒãpip install "scrapling[all]"scrapling installã§ãã©ãŠã¶ã®äŸåé¢ä¿ãã€ã³ã¹ããŒã«ããå¿ èŠãããããšãå¿ããªãã§ãã ãã
- MCP ãµãŒããŒæ©èœãã€ã³ã¹ããŒã«ïŒ
Docker
DockerHub ããæ¬¡ã®ã³ãã³ãã§ãã¹ãŠã®è¿œå æ©èœãšãã©ãŠã¶ãå«ã Docker ã€ã¡ãŒãžãã€ã³ã¹ããŒã«ããããšãã§ããŸãïŒ
docker pull pyd4vinci/scrapling
ãŸã㯠GitHub ã¬ãžã¹ããªããããŠã³ããŒãïŒ
docker pull ghcr.io/d4vinci/scrapling:latest
ãã®ã€ã¡ãŒãžã¯ãGitHub Actions ãšãªããžããªã®ã¡ã€ã³ãã©ã³ãã䜿çšããŠèªåçã«ãã«ãããã³ããã·ã¥ãããŸãã
è²¢ç®
è²¢ç®ãæè¿ããŸãïŒå§ããåã«è²¢ç®ã¬ã€ãã©ã€ã³ããèªã¿ãã ããã
å 責äºé
ãã®ã©ã€ãã©ãªã¯æè²ããã³ç ç©¶ç®çã®ã¿ã§æäŸãããŠããŸãããã®ã©ã€ãã©ãªã䜿çšããããšã«ãããå°åããã³åœéçãªããŒã¿ã¹ã¯ã¬ã€ãã³ã°ããã³ãã©ã€ãã·ãŒæ³ã«æºæ ããããšã«åæãããã®ãšã¿ãªãããŸããèè ããã³è²¢ç®è ã¯ããã®ãœãããŠã§ã¢ã®èª€çšã«ã€ããŠè²¬ä»»ãè² ããŸãããåžžã«ãŠã§ããµã€ãã®å©çšèŠçŽãšrobots.txt ãã¡ã€ã«ãå°éããŠãã ããã
ð åŒçš
ç ç©¶ç®çã§åœã©ã€ãã©ãªã䜿çšãããå Žåã¯ã以äžã®åèæç®ã§åŒçšããŠãã ããïŒ
@misc{scrapling,
author = {Karim Shoair},
title = {Scrapling},
year = {2024},
url = {https://github.com/D4Vinci/Scrapling},
note = {An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!}
}
ã©ã€ã»ã³ã¹
ãã®äœå㯠BSD-3-Clause ã©ã€ã»ã³ã¹ã®äžã§ã©ã€ã»ã³ã¹ãããŠããŸãã
è¬èŸ
ãã®ãããžã§ã¯ãã«ã¯æ¬¡ããé©å¿ãããã³ãŒããå«ãŸããŠããŸãïŒ
- ParselïŒBSD ã©ã€ã»ã³ã¹ïŒâ translator ãµãã¢ãžã¥ãŒã«ã«äœ¿çš





