Scrapling / docs /README_KR.md
Karim shoair
docs: adding a new sponsor
a0dfed9

Scrapling Poster
Effortless Web Scraping for the Modern Web

Tests PyPI version PyPI package downloads Static Badge OpenClaw Skill
Discord X (formerly Twitter) Follow
Supported Python versions

์„ ํƒ ๋ฉ”์„œ๋“œ ยท Fetcher ์„ ํƒ ๊ฐ€์ด๋“œ ยท Spider ยท ํ”„๋ก์‹œ ๋กœํ…Œ์ด์…˜ ยท CLI ยท MCP ์„œ๋ฒ„

Scrapling์€ ๋‹จ์ผ ์š”์ฒญ๋ถ€ํ„ฐ ๋Œ€๊ทœ๋ชจ ํฌ๋กค๋ง๊นŒ์ง€ ๋ชจ๋“  ๊ฒƒ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ์ ์‘ํ˜• Web Scraping ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค.

ํŒŒ์„œ๋Š” ์›น์‚ฌ์ดํŠธ ๋ณ€๊ฒฝ ์‚ฌํ•ญ์„ ํ•™์Šตํ•˜๊ณ , ํŽ˜์ด์ง€๊ฐ€ ์—…๋ฐ์ดํŠธ๋˜๋ฉด ์š”์†Œ๋ฅผ ์ž๋™์œผ๋กœ ์žฌ๋ฐฐ์น˜ํ•ฉ๋‹ˆ๋‹ค. Fetcher๋Š” Cloudflare Turnstile ๊ฐ™์€ ์•ˆํ‹ฐ๋ด‡ ์‹œ์Šคํ…œ์„ ๋ณ„๋„ ์„ค์ • ์—†์ด ์šฐํšŒํ•ฉ๋‹ˆ๋‹ค. Spider ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ผ์‹œ์ •์ง€/์žฌ๊ฐœ ๋ฐ ์ž๋™ ํ”„๋ก์‹œ ๋กœํ…Œ์ด์…˜์„ ๊ฐ–์ถ˜ ๋™์‹œ ๋ฉ€ํ‹ฐ ์„ธ์…˜ ํฌ๋กค๋ง์œผ๋กœ ํ™•์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค โ€” ๋ชจ๋‘ Python ๋ช‡ ์ค„์ด๋ฉด ๋ฉ๋‹ˆ๋‹ค. ํ•˜๋‚˜์˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ, ํƒ€ํ˜‘ ์—†๋Š” ์„ฑ๋Šฅ.

์‹ค์‹œ๊ฐ„ ํ†ต๊ณ„์™€ ์ŠคํŠธ๋ฆฌ๋ฐ์„ ํ†ตํ•œ ์ดˆ๊ณ ์† ํฌ๋กค๋ง. Web Scraper๊ฐ€ ๋งŒ๋“ค๊ณ , Web Scraper์™€ ์ผ๋ฐ˜ ์‚ฌ์šฉ์ž ๋ชจ๋‘๋ฅผ ์œ„ํ•ด ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.

from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
StealthyFetcher.adaptive = True
p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)  # ํƒ์ง€๋ฅผ ํ”ผํ•ด ์›น์‚ฌ์ดํŠธ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค!
products = p.css('.product', auto_save=True)                                        # ์›น์‚ฌ์ดํŠธ ๋””์ž์ธ ๋ณ€๊ฒฝ์—๋„ ์‚ด์•„๋‚จ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์Šคํฌ๋ ˆ์ดํ•‘!
products = p.css('.product', adaptive=True)                                         # ๋‚˜์ค‘์— ์›น์‚ฌ์ดํŠธ ๊ตฌ์กฐ๊ฐ€ ๋ฐ”๋€Œ๋ฉด, `adaptive=True`๋ฅผ ์ „๋‹ฌํ•ด์„œ ์ฐพ์œผ์„ธ์š”!

๋˜๋Š” ๋ณธ๊ฒฉ์ ์ธ ํฌ๋กค๋ง์œผ๋กœ ํ™•์žฅ

from scrapling.spiders import Spider, Response

class MySpider(Spider):
  name = "demo"
  start_urls = ["https://example.com/"]

  async def parse(self, response: Response):
      for item in response.css('.product'):
          yield {"title": item.css('h2::text').get()}

MySpider().start()

At DataImpulse, we specialize in developing custom proxy services for your business. Make requests from anywhere, collect data, and enjoy fast connections with our premium proxies.

ํ”Œ๋ž˜ํ‹ฐ๋„˜ ์Šคํฐ์„œ

Scrapling์€ Cloudflare Turnstile์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์—”ํ„ฐํ”„๋ผ์ด์ฆˆ๊ธ‰ ๋ณดํ˜ธ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๋ฉด, Hyper Solutions ๊ฐ€ Akamai, DataDome, Kasada, Incapsula์šฉ ์œ ํšจํ•œ ์•ˆํ‹ฐ๋ด‡ ํ† ํฐ์„ ์ƒ์„ฑํ•˜๋Š” API ์—”๋“œํฌ์ธํŠธ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๊ฐ„๋‹จํ•œ API ํ˜ธ์ถœ๋งŒ์œผ๋กœ, ๋ธŒ๋ผ์šฐ์ € ์ž๋™ํ™”๊ฐ€ ํ•„์š” ์—†์Šต๋‹ˆ๋‹ค.
ํ”„๋ก์‹œ๋Š” ๋ณต์žกํ•˜๊ฑฐ๋‚˜ ๋น„์Œ€ ์ด์œ ๊ฐ€ ์—†๋‹ค๋Š” ์ƒ๊ฐ์œผ๋กœ BirdProxies ๋ฅผ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค.
195๊ฐœ ์ด์ƒ ์ง€์—ญ์˜ ๋น ๋ฅธ ๋ ˆ์ง€๋ด์…œ ๋ฐ ISP ํ”„๋ก์‹œ, ํ•ฉ๋ฆฌ์ ์ธ ๊ฐ€๊ฒฉ, ์‹ค์งˆ์ ์ธ ์ง€์›.
๋žœ๋”ฉ ํŽ˜์ด์ง€์—์„œ FlappyBird ๊ฒŒ์ž„์„ ํ”Œ๋ ˆ์ดํ•˜๊ณ  ๋ฌด๋ฃŒ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์œผ์„ธ์š”!
Evomi : ๋ ˆ์ง€๋ด์…œ ํ”„๋ก์‹œ GB๋‹น $0.49๋ถ€ํ„ฐ. ์™„์ „ํžˆ ์œ„์žฅ๋œ Chromium ์Šคํฌ๋ ˆ์ดํ•‘ ๋ธŒ๋ผ์šฐ์ €, ๋ ˆ์ง€๋ด์…œ IP, ์ž๋™ CAPTCHA ํ•ด๊ฒฐ, ์•ˆํ‹ฐ๋ด‡ ์šฐํšŒ.
Scraper API๋กœ ๋ฒˆ๊ฑฐ๋กœ์›€ ์—†์ด ๊ฒฐ๊ณผ๋ฅผ ์–ป์œผ์„ธ์š”. MCP ๋ฐ N8N ํ†ตํ•ฉ ์ง€์›.
TikHub.io๋Š” TikTok, X, YouTube, Instagram ๋“ฑ 16๊ฐœ ์ด์ƒ ํ”Œ๋žซํผ์—์„œ 900๊ฐœ ์ด์ƒ์˜ ์•ˆ์ •์ ์ธ API๋ฅผ ์ œ๊ณตํ•˜๋ฉฐ, 4,000๋งŒ ์ด์ƒ์˜ ๋ฐ์ดํ„ฐ์…‹์„ ๋ณด์œ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
ํ• ์ธ๋œ AI ๋ชจ๋ธ๋„ ์ œ๊ณต โ€” Claude, GPT, GEMINI ๋“ฑ ์ตœ๋Œ€ 71% ํ• ์ธ.
Nsocks๋Š” ๊ฐœ๋ฐœ์ž์™€ ์Šคํฌ๋ ˆ์ดํผ๋ฅผ ์œ„ํ•œ ๋น ๋ฅธ ๋ ˆ์ง€๋ด์…œ ๋ฐ ISP ํ”„๋ก์‹œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๊ธ€๋กœ๋ฒŒ IP ์ปค๋ฒ„๋ฆฌ์ง€, ๋†’์€ ์ต๋ช…์„ฑ, ์Šค๋งˆํŠธ ๋กœํ…Œ์ด์…˜, ์ž๋™ํ™”์™€ ๋ฐ์ดํ„ฐ ์ถ”์ถœ์„ ์œ„ํ•œ ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ. Xcrawl๋กœ ๋Œ€๊ทœ๋ชจ ์›น ํฌ๋กค๋ง์„ ๊ฐ„์†Œํ™”ํ•˜์„ธ์š”.
๋…ธํŠธ๋ถ์„ ๋‹ซ์œผ์„ธ์š”. ์Šคํฌ๋ž˜ํผ๋Š” ๊ณ„์† ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.
PetroSky VPS - ๋…ผ์Šคํ†ฑ ์ž๋™ํ™”๋ฅผ ์œ„ํ•œ ํด๋ผ์šฐ๋“œ ์„œ๋ฒ„. Windows ๋ฐ Linux ๋จธ์‹ ์„ ์™„๋ฒฝํ•˜๊ฒŒ ์ œ์–ด. ์›” โ‚ฌ6.99๋ถ€ํ„ฐ.
The Web Scraping Club์—์„œ Scrapling์˜ ์ „์ฒด ๋ฆฌ๋ทฐ(2025๋…„ 11์›”)๋ฅผ ์ฝ์–ด๋ณด์„ธ์š”. ์›น ์Šคํฌ๋ž˜ํ•‘ ์ „๋ฌธ No.1 ๋‰ด์Šค๋ ˆํ„ฐ์ž…๋‹ˆ๋‹ค.
Proxy-Seller๋Š” ์›น ์Šคํฌ๋ž˜ํ•‘์„ ์œ„ํ•œ ์•ˆ์ •์ ์ธ ํ”„๋ก์‹œ ์ธํ”„๋ผ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. IPv4, IPv6, ISP, ์ฃผ๊ฑฐ์šฉ ๋ฐ ๋ชจ๋ฐ”์ผ ํ”„๋ก์‹œ๋ฅผ ์ง€์›ํ•˜๋ฉฐ, ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ, ๊ด‘๋ฒ”์œ„ํ•œ ์ง€์—ญ ์ปค๋ฒ„๋ฆฌ์ง€, ๊ธฐ์—… ๊ทœ๋ชจ์˜ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์„ ์œ„ํ•œ ์œ ์—ฐํ•œ ์š”๊ธˆ์ œ๋ฅผ ๊ฐ–์ถ”๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์— ๊ด‘๊ณ ๋ฅผ ๊ฒŒ์žฌํ•˜๊ณ  ์‹ถ์œผ์‹ ๊ฐ€์š”? ์—ฌ๊ธฐ๋ฅผ ํด๋ฆญํ•˜์„ธ์š”

์Šคํฐ์„œ

์—ฌ๊ธฐ์— ๊ด‘๊ณ ๋ฅผ ๊ฒŒ์žฌํ•˜๊ณ  ์‹ถ์œผ์‹ ๊ฐ€์š”? ์—ฌ๊ธฐ๋ฅผ ํด๋ฆญํ•˜๊ณ  ์›ํ•˜๋Š” ํ‹ฐ์–ด๋ฅผ ์„ ํƒํ•˜์„ธ์š”!


์ฃผ์š” ๊ธฐ๋Šฅ

Spider โ€” ๋ณธ๊ฒฉ์ ์ธ ํฌ๋กค๋ง ํ”„๋ ˆ์ž„์›Œํฌ

  • ๐Ÿ•ท๏ธ Scrapy ์Šคํƒ€์ผ Spider API: start_urls, ๋น„๋™๊ธฐ parse ์ฝœ๋ฐฑ, Request/Response ๊ฐ์ฒด๋กœ Spider๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.
  • โšก ๋™์‹œ ํฌ๋กค๋ง: ์„ค์ • ๊ฐ€๋Šฅํ•œ ๋™์‹œ ์š”์ฒญ ์ˆ˜ ์ œํ•œ, ๋„๋ฉ”์ธ๋ณ„ ์Šค๋กœํ‹€๋ง, ๋‹ค์šด๋กœ๋“œ ๋”œ๋ ˆ์ด๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
  • ๐Ÿ”„ ๋ฉ€ํ‹ฐ ์„ธ์…˜ ์ง€์›: HTTP ์š”์ฒญ๊ณผ ์Šคํ…”์Šค ํ—ค๋“œ๋ฆฌ์Šค ๋ธŒ๋ผ์šฐ์ €๋ฅผ ํ•˜๋‚˜์˜ ์ธํ„ฐํŽ˜์ด์Šค๋กœ ํ†ตํ•ฉ โ€” ID๋กœ ์š”์ฒญ์„ ๋‹ค๋ฅธ ์„ธ์…˜์— ๋ผ์šฐํŒ…ํ•ฉ๋‹ˆ๋‹ค.
  • ๐Ÿ’พ ์ผ์‹œ์ •์ง€ & ์žฌ๊ฐœ: ์ฒดํฌํฌ์ธํŠธ ๊ธฐ๋ฐ˜์˜ ํฌ๋กค๋ง ์˜์†ํ™”. Ctrl+C๋กœ ์ •์ƒ ์ข…๋ฃŒํ•˜๊ณ , ์žฌ์‹œ์ž‘ํ•˜๋ฉด ์ค‘๋‹จ๋œ ์ง€์ ๋ถ€ํ„ฐ ์ด์–ด๊ฐ‘๋‹ˆ๋‹ค.
  • ๐Ÿ“ก ์ŠคํŠธ๋ฆฌ๋ฐ ๋ชจ๋“œ: async for item in spider.stream()์œผ๋กœ ์Šคํฌ๋ ˆ์ดํ•‘๋œ ์•„์ดํ…œ์„ ์‹ค์‹œ๊ฐ„ ํ†ต๊ณ„์™€ ํ•จ๊ป˜ ์ŠคํŠธ๋ฆฌ๋ฐ์œผ๋กœ ์ˆ˜์‹  โ€” UI, ํŒŒ์ดํ”„๋ผ์ธ, ์žฅ์‹œ๊ฐ„ ํฌ๋กค๋ง์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.
  • ๐Ÿ›ก๏ธ ์ฐจ๋‹จ๋œ ์š”์ฒญ ๊ฐ์ง€: ์ปค์Šคํ…€ ๋กœ์ง์„ ํ†ตํ•œ ์ฐจ๋‹จ๋œ ์š”์ฒญ์˜ ์ž๋™ ๊ฐ์ง€ ๋ฐ ์žฌ์‹œ๋„๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
  • ๐Ÿ“ฆ ๋‚ด์žฅ ๋‚ด๋ณด๋‚ด๊ธฐ: ํ›…์ด๋‚˜ ์ž์ฒด ํŒŒ์ดํ”„๋ผ์ธ, ๋˜๋Š” ๋‚ด์žฅ JSON/JSONL๋กœ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋ณด๋ƒ…๋‹ˆ๋‹ค. ๊ฐ๊ฐ result.items.to_json() / result.items.to_jsonl()์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์„ธ์…˜์„ ์ง€์›ํ•˜๋Š” ๊ณ ๊ธ‰ ์›น์‚ฌ์ดํŠธ ๊ฐ€์ ธ์˜ค๊ธฐ

  • HTTP ์š”์ฒญ: Fetcher ํด๋ž˜์Šค๋กœ ๋น ๋ฅด๊ณ  ์€๋ฐ€ํ•œ HTTP ์š”์ฒญ. ๋ธŒ๋ผ์šฐ์ €์˜ TLS fingerprint, ํ—ค๋”๋ฅผ ๋ชจ๋ฐฉํ•˜๊ณ , HTTP/3๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋™์  ๋กœ๋”ฉ: Playwright์˜ Chromium๊ณผ Google Chrome์„ ์ง€์›ํ•˜๋Š” DynamicFetcher ํด๋ž˜์Šค๋กœ ์™„์ „ํ•œ ๋ธŒ๋ผ์šฐ์ € ์ž๋™ํ™”๋ฅผ ํ†ตํ•ด ๋™์  ์›น์‚ฌ์ดํŠธ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.
  • ์•ˆํ‹ฐ๋ด‡ ์šฐํšŒ: StealthyFetcher์™€ fingerprint ์œ„์žฅ์„ ํ†ตํ•œ ๊ณ ๊ธ‰ ์Šคํ…”์Šค ๊ธฐ๋Šฅ. ์ž๋™ํ™”๋กœ ๋ชจ๋“  ์œ ํ˜•์˜ Cloudflare Turnstile/Interstitial์„ ์†์‰ฝ๊ฒŒ ์šฐํšŒํ•ฉ๋‹ˆ๋‹ค.
  • ์„ธ์…˜ ๊ด€๋ฆฌ: FetcherSession, StealthySession, DynamicSession ํด๋ž˜์Šค๋กœ ์š”์ฒญ ๊ฐ„ ์ฟ ํ‚ค์™€ ์ƒํƒœ๋ฅผ ๊ด€๋ฆฌํ•˜๋Š” ์˜์†์  ์„ธ์…˜์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
  • ํ”„๋ก์‹œ ๋กœํ…Œ์ด์…˜: ๋ชจ๋“  ์„ธ์…˜ ํƒ€์ž…์— ๋Œ€์‘ํ•˜๋Š” ์ˆœํ™˜ ๋˜๋Š” ์ปค์Šคํ…€ ์ „๋žต์˜ ๋‚ด์žฅ ProxyRotator์™€ ์š”์ฒญ๋ณ„ ํ”„๋ก์‹œ ์˜ค๋ฒ„๋ผ์ด๋“œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • ๋„๋ฉ”์ธ ์ฐจ๋‹จ: ๋ธŒ๋ผ์šฐ์ € ๊ธฐ๋ฐ˜ Fetcher์—์„œ ํŠน์ • ๋„๋ฉ”์ธ(๋ฐ ํ•˜์œ„ ๋„๋ฉ”์ธ)์œผ๋กœ์˜ ์š”์ฒญ์„ ์ฐจ๋‹จํ•ฉ๋‹ˆ๋‹ค.
  • ๋น„๋™๊ธฐ ์ง€์›: ๋ชจ๋“  Fetcher์™€ ์ „์šฉ ๋น„๋™๊ธฐ ์„ธ์…˜ ํด๋ž˜์Šค์—์„œ ์™„์ „ํ•œ ๋น„๋™๊ธฐ๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

์ ์‘ํ˜• ์Šคํฌ๋ ˆ์ดํ•‘ & AI ํ†ตํ•ฉ

  • ๐Ÿ”„ ์Šค๋งˆํŠธ ์š”์†Œ ์ถ”์ : ์ง€๋Šฅ์ ์ธ ์œ ์‚ฌ๋„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์›น์‚ฌ์ดํŠธ ๋ณ€๊ฒฝ ํ›„์—๋„ ์š”์†Œ๋ฅผ ์žฌ๋ฐฐ์น˜ํ•ฉ๋‹ˆ๋‹ค.
  • ๐ŸŽฏ ์œ ์—ฐํ•œ ์Šค๋งˆํŠธ ์„ ํƒ: CSS selector, XPath selector, ํ•„ํ„ฐ ๊ธฐ๋ฐ˜ ๊ฒ€์ƒ‰, ํ…์ŠคํŠธ ๊ฒ€์ƒ‰, ์ •๊ทœ์‹ ๊ฒ€์ƒ‰ ๋“ฑ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
  • ๐Ÿ” ์œ ์‚ฌ ์š”์†Œ ์ฐพ๊ธฐ: ๋ฐœ๊ฒฌ๋œ ์š”์†Œ์™€ ์œ ์‚ฌํ•œ ์š”์†Œ๋ฅผ ์ž๋™์œผ๋กœ ์ฐพ์•„๋ƒ…๋‹ˆ๋‹ค.
  • ๐Ÿค– AI์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” MCP ์„œ๋ฒ„: AI ๊ธฐ๋ฐ˜ Web Scraping๊ณผ ๋ฐ์ดํ„ฐ ์ถ”์ถœ์„ ์œ„ํ•œ ๋‚ด์žฅ MCP ์„œ๋ฒ„. AI(Claude/Cursor ๋“ฑ)์— ์ „๋‹ฌํ•˜๊ธฐ ์ „์— Scrapling์„ ํ™œ์šฉํ•ด ๋Œ€์ƒ ์ฝ˜ํ…์ธ ๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ฐ•๋ ฅํ•œ ์ปค์Šคํ…€ ๊ธฐ๋Šฅ์„ ๊ฐ–์ถ”๊ณ  ์žˆ์–ด, ์ž‘์—… ์†๋„๋ฅผ ๋†’์ด๊ณ  ํ† ํฐ ์‚ฌ์šฉ๋Ÿ‰์„ ์ตœ์†Œํ™”ํ•ด ๋น„์šฉ์„ ์ ˆ๊ฐํ•ฉ๋‹ˆ๋‹ค. (๋ฐ๋ชจ ์˜์ƒ)

๊ณ ์„ฑ๋Šฅ & ์‹ค์ „ ๊ฒ€์ฆ๋œ ์•„ํ‚คํ…์ฒ˜

  • ๐Ÿš€ ์ดˆ๊ณ ์†: ๋Œ€๋ถ€๋ถ„์˜ Python ์Šคํฌ๋ ˆ์ดํ•‘ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ๋Šฅ๊ฐ€ํ•˜๋Š” ์ตœ์ ํ™”๋œ ์„ฑ๋Šฅ.
  • ๐Ÿ”‹ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ: ์ตœ์ ํ™”๋œ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ์™€ ์ง€์—ฐ ๋กœ๋”ฉ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค.
  • โšก ๊ณ ์† JSON ์ง๋ ฌํ™”: ํ‘œ์ค€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ณด๋‹ค 10๋ฐฐ ๋น ๋ฆ…๋‹ˆ๋‹ค.
  • ๐Ÿ—๏ธ ์‹ค์ „ ๊ฒ€์ฆ: Scrapling์€ 92%์˜ ํ…Œ์ŠคํŠธ ์ปค๋ฒ„๋ฆฌ์ง€์™€ ์™„์ „ํ•œ ํƒ€์ž… ํžŒํŠธ ์ปค๋ฒ„๋ฆฌ์ง€๋ฅผ ๊ฐ–์ถ”๊ณ  ์žˆ์„ ๋ฟ ์•„๋‹ˆ๋ผ, ์ง€๋‚œ 1๋…„๊ฐ„ ์ˆ˜๋ฐฑ ๋ช…์˜ Web Scraper๊ฐ€ ๋งค์ผ ์‚ฌ์šฉํ•ด ์™”์Šต๋‹ˆ๋‹ค.

๊ฐœ๋ฐœ์ž/Web Scraper ์นœํ™”์  ๊ฒฝํ—˜

  • ๐ŸŽฏ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ Web Scraping Shell: Scrapling ํ†ตํ•ฉ, ๋‹จ์ถ•ํ‚ค, curl ์š”์ฒญ์„ Scrapling ์š”์ฒญ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ฑฐ๋‚˜ ๋ธŒ๋ผ์šฐ์ €์—์„œ ์š”์ฒญ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜๋Š” ๋“ฑ์˜ ๋„๊ตฌ๋ฅผ ๊ฐ–์ถ˜ ์„ ํƒ์  ๋‚ด์žฅ IPython Shell๋กœ, Web Scraping ์Šคํฌ๋ฆฝํŠธ ๊ฐœ๋ฐœ์„ ๊ฐ€์†ํ•ฉ๋‹ˆ๋‹ค.
  • ๐Ÿš€ ํ„ฐ๋ฏธ๋„์—์„œ ๋ฐ”๋กœ ์‚ฌ์šฉ: ์ฝ”๋“œ ํ•œ ์ค„ ์—†์ด Scrapling์œผ๋กœ URL์„ ์Šคํฌ๋ ˆ์ดํ•‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!
  • ๐Ÿ› ๏ธ ํ’๋ถ€ํ•œ ๋‚ด๋น„๊ฒŒ์ด์…˜ API: ๋ถ€๋ชจ, ํ˜•์ œ, ์ž์‹ ํƒ์ƒ‰ ๋ฉ”์„œ๋“œ๋ฅผ ํ†ตํ•œ ๊ณ ๊ธ‰ DOM ์ˆœํšŒ๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
  • ๐Ÿงฌ ํ–ฅ์ƒ๋œ ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ: ๋‚ด์žฅ ์ •๊ทœ์‹, ํด๋ฆฌ๋‹ ๋ฉ”์„œ๋“œ, ์ตœ์ ํ™”๋œ ๋ฌธ์ž์—ด ์—ฐ์‚ฐ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • ๐Ÿ“ ์ž๋™ ์…€๋ ‰ํ„ฐ ์ƒ์„ฑ: ๋ชจ๋“  ์š”์†Œ์— ๋Œ€ํ•ด ๊ฒฌ๊ณ ํ•œ CSS/XPath selector๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ๐Ÿ”Œ ์ต์ˆ™ํ•œ API: Scrapy/Parsel์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•œ ์˜์‚ฌ ์š”์†Œ(pseudo-element)๋ฅผ ๊ฐ€์ง„ Scrapy/BeautifulSoup ์Šคํƒ€์ผ์˜ API.
  • ๐Ÿ“˜ ์™„์ „ํ•œ ํƒ€์ž… ์ปค๋ฒ„๋ฆฌ์ง€: ๋›ฐ์–ด๋‚œ IDE ์ง€์›๊ณผ ์ฝ”๋“œ ์ž๋™์™„์„ฑ์„ ์œ„ํ•œ ์™„์ „ํ•œ ํƒ€์ž… ํžŒํŠธ. ์ฝ”๋“œ๋ฒ ์ด์Šค ์ „์ฒด๊ฐ€ ๋ณ€๊ฒฝ๋  ๋•Œ๋งˆ๋‹ค PyRight์™€ MyPy๋กœ ์ž๋™ ๊ฒ€์‚ฌ๋ฉ๋‹ˆ๋‹ค.
  • ๐Ÿ”‹ ๋ฐ”๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ Docker ์ด๋ฏธ์ง€: ๋งค ๋ฆด๋ฆฌ์Šค๋งˆ๋‹ค ๋ชจ๋“  ๋ธŒ๋ผ์šฐ์ €๋ฅผ ํฌํ•จํ•œ Docker ์ด๋ฏธ์ง€๊ฐ€ ์ž๋™์œผ๋กœ ๋นŒ๋“œ ๋ฐ ํ‘ธ์‹œ๋ฉ๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ

๊นŠ์ด ๋“ค์–ด๊ฐ€์ง€ ์•Š๊ณ , Scrapling์ด ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ๋“ค์„ ๊ฐ„๋‹จํžˆ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•

์„ธ์…˜์„ ์ง€์›ํ•˜๋Š” HTTP ์š”์ฒญ

from scrapling.fetchers import Fetcher, FetcherSession

with FetcherSession(impersonate='chrome') as session:  # Chrome์˜ ์ตœ์‹  TLS fingerprint ์‚ฌ์šฉ
    page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
    quotes = page.css('.quote .text::text').getall()

# ๋˜๋Š” ์ผํšŒ์„ฑ ์š”์ฒญ ์‚ฌ์šฉ
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()

๊ณ ๊ธ‰ ์Šคํ…”์Šค ๋ชจ๋“œ

from scrapling.fetchers import StealthyFetcher, StealthySession

with StealthySession(headless=True, solve_cloudflare=True) as session:  # ์ž‘์—…์ด ๋๋‚  ๋•Œ๊นŒ์ง€ ๋ธŒ๋ผ์šฐ์ €๋ฅผ ์—ด์–ด๋‘ก๋‹ˆ๋‹ค
    page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
    data = page.css('#padded_content a').getall()

# ๋˜๋Š” ์ผํšŒ์„ฑ ์š”์ฒญ ์Šคํƒ€์ผ โ€” ์ด ์š”์ฒญ์„ ์œ„ํ•ด ๋ธŒ๋ผ์šฐ์ €๋ฅผ ์—ด๊ณ , ์™„๋ฃŒ ํ›„ ๋‹ซ์Šต๋‹ˆ๋‹ค
page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
data = page.css('#padded_content a').getall()

์™„์ „ํ•œ ๋ธŒ๋ผ์šฐ์ € ์ž๋™ํ™”

from scrapling.fetchers import DynamicFetcher, DynamicSession

with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session:  # ์ž‘์—…์ด ๋๋‚  ๋•Œ๊นŒ์ง€ ๋ธŒ๋ผ์šฐ์ €๋ฅผ ์—ด์–ด๋‘ก๋‹ˆ๋‹ค
    page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
    data = page.xpath('//span[@class="text"]/text()').getall()  # ์›ํ•˜์‹œ๋ฉด XPath selector๋„ ์‚ฌ์šฉ ๊ฐ€๋Šฅ

# ๋˜๋Š” ์ผํšŒ์„ฑ ์š”์ฒญ ์Šคํƒ€์ผ โ€” ์ด ์š”์ฒญ์„ ์œ„ํ•ด ๋ธŒ๋ผ์šฐ์ €๋ฅผ ์—ด๊ณ , ์™„๋ฃŒ ํ›„ ๋‹ซ์Šต๋‹ˆ๋‹ค
page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
data = page.css('.quote .text::text').getall()

Spider

๋™์‹œ ์š”์ฒญ, ์—ฌ๋Ÿฌ ์„ธ์…˜ ํƒ€์ž…, ์ผ์‹œ์ •์ง€ & ์žฌ๊ฐœ๋ฅผ ๊ฐ–์ถ˜ ๋ณธ๊ฒฉ์ ์ธ ํฌ๋กค๋Ÿฌ ๊ตฌ์ถ•:

from scrapling.spiders import Spider, Request, Response

class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]
    concurrent_requests = 10

    async def parse(self, response: Response):
        for quote in response.css('.quote'):
            yield {
                "text": quote.css('.text::text').get(),
                "author": quote.css('.author::text').get(),
            }

        next_page = response.css('.next a')
        if next_page:
            yield response.follow(next_page[0].attrib['href'])

result = QuotesSpider().start()
print(f"{len(result.items)}๊ฐœ์˜ ์ธ์šฉ๊ตฌ๋ฅผ ์Šคํฌ๋ ˆ์ดํ•‘ํ–ˆ์Šต๋‹ˆ๋‹ค")
result.items.to_json("quotes.json")

ํ•˜๋‚˜์˜ Spider์—์„œ ์—ฌ๋Ÿฌ ์„ธ์…˜ ํƒ€์ž… ์‚ฌ์šฉ:

from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession

class MultiSessionSpider(Spider):
    name = "multi"
    start_urls = ["https://example.com/"]

    def configure_sessions(self, manager):
        manager.add("fast", FetcherSession(impersonate="chrome"))
        manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)

    async def parse(self, response: Response):
        for link in response.css('a::attr(href)').getall():
            # ๋ณดํ˜ธ๋œ ํŽ˜์ด์ง€๋Š” ์Šคํ…”์Šค ์„ธ์…˜์„ ํ†ตํ•ด ๋ผ์šฐํŒ…
            if "protected" in link:
                yield Request(link, sid="stealth")
            else:
                yield Request(link, sid="fast", callback=self.parse)  # ๋ช…์‹œ์  ์ฝœ๋ฐฑ

์ฒดํฌํฌ์ธํŠธ๋ฅผ ์‚ฌ์šฉํ•ด ์žฅ์‹œ๊ฐ„ ํฌ๋กค๋ง์„ ์ผ์‹œ์ •์ง€ & ์žฌ๊ฐœ:

QuotesSpider(crawldir="./crawl_data").start()

Ctrl+C๋ฅผ ๋ˆ„๋ฅด๋ฉด ์ •์ƒ์ ์œผ๋กœ ์ผ์‹œ์ •์ง€๋˜๊ณ , ์ง„ํ–‰ ์ƒํ™ฉ์ด ์ž๋™ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค. ์ดํ›„ Spider๋ฅผ ๋‹ค์‹œ ์‹œ์ž‘ํ•  ๋•Œ ๋™์ผํ•œ crawldir์„ ์ „๋‹ฌํ•˜๋ฉด ์ค‘๋‹จ๋œ ์ง€์ ๋ถ€ํ„ฐ ์žฌ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

๊ณ ๊ธ‰ ํŒŒ์‹ฑ & ๋‚ด๋น„๊ฒŒ์ด์…˜

from scrapling.fetchers import Fetcher

# ํ’๋ถ€ํ•œ ์š”์†Œ ์„ ํƒ๊ณผ ๋‚ด๋น„๊ฒŒ์ด์…˜
page = Fetcher.get('https://quotes.toscrape.com/')

# ์—ฌ๋Ÿฌ ์„ ํƒ ๋ฉ”์„œ๋“œ๋กœ ์ธ์šฉ๊ตฌ ๊ฐ€์ ธ์˜ค๊ธฐ
quotes = page.css('.quote')  # CSS selector
quotes = page.xpath('//div[@class="quote"]')  # XPath
quotes = page.find_all('div', {'class': 'quote'})  # BeautifulSoup ์Šคํƒ€์ผ
# ์•„๋ž˜์™€ ๋™์ผ
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(['div'], class_='quote')
quotes = page.find_all(class_='quote')  # ๋“ฑ๋“ฑ...
# ํ…์ŠคํŠธ ๋‚ด์šฉ์œผ๋กœ ์š”์†Œ ์ฐพ๊ธฐ
quotes = page.find_by_text('quote', tag='div')

# ๊ณ ๊ธ‰ ๋‚ด๋น„๊ฒŒ์ด์…˜
quote_text = page.css('.quote')[0].css('.text::text').get()
quote_text = page.css('.quote').css('.text::text').getall()  # ์ฒด์ด๋‹ ์…€๋ ‰ํ„ฐ
first_quote = page.css('.quote')[0]
author = first_quote.next_sibling.css('.author::text')
parent_container = first_quote.parent

# ์š”์†Œ ๊ด€๊ณ„์™€ ์œ ์‚ฌ๋„
similar_elements = first_quote.find_similar()
below_elements = first_quote.below_elements()

์›น์‚ฌ์ดํŠธ๋ฅผ ๊ฐ€์ ธ์˜ค์ง€ ์•Š๊ณ  ํŒŒ์„œ๋ฅผ ๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค:

from scrapling.parser import Selector

page = Selector("<html>...</html>")

์‚ฌ์šฉ๋ฒ•์€ ์™„์ „ํžˆ ๋™์ผํ•ฉ๋‹ˆ๋‹ค!

๋น„๋™๊ธฐ ์„ธ์…˜ ๊ด€๋ฆฌ ์˜ˆ์‹œ

import asyncio
from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession

async with FetcherSession(http3=True) as session:  # `FetcherSession`์€ ์ปจํ…์ŠคํŠธ ์ธ์‹์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ ๋™๊ธฐ/๋น„๋™๊ธฐ ํŒจํ„ด ๋ชจ๋‘์—์„œ ์ž‘๋™
    page1 = session.get('https://quotes.toscrape.com/')
    page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')

# ๋น„๋™๊ธฐ ์„ธ์…˜ ์‚ฌ์šฉ
async with AsyncStealthySession(max_pages=2) as session:
    tasks = []
    urls = ['https://example.com/page1', 'https://example.com/page2']

    for url in urls:
        task = session.fetch(url)
        tasks.append(task)

    print(session.get_pool_stats())  # ์„ ํƒ ์‚ฌํ•ญ - ๋ธŒ๋ผ์šฐ์ € ํƒญ ํ’€ ์ƒํƒœ (์‚ฌ์šฉ ์ค‘/์œ ํœด/์—๋Ÿฌ)
    results = await asyncio.gather(*tasks)
    print(session.get_pool_stats())

CLI & ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ Shell

Scrapling์—๋Š” ๊ฐ•๋ ฅํ•œ ์ปค๋งจ๋“œ๋ผ์ธ ์ธํ„ฐํŽ˜์ด์Šค๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

asciicast

์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ Web Scraping Shell ์‹คํ–‰

scrapling shell

ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์—†์ด ํŽ˜์ด์ง€๋ฅผ ํŒŒ์ผ๋กœ ๋ฐ”๋กœ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค (๊ธฐ๋ณธ์ ์œผ๋กœ body ํƒœ๊ทธ ๋‚ด๋ถ€์˜ ์ฝ˜ํ…์ธ ๋ฅผ ์ถ”์ถœ). ์ถœ๋ ฅ ํŒŒ์ผ์ด .txt๋กœ ๋๋‚˜๋ฉด ๋Œ€์ƒ์˜ ํ…์ŠคํŠธ ์ฝ˜ํ…์ธ ๊ฐ€ ์ถ”์ถœ๋ฉ๋‹ˆ๋‹ค. .md๋กœ ๋๋‚˜๋ฉด HTML ์ฝ˜ํ…์ธ ์˜ Markdown ํ‘œํ˜„์ด ๋ฉ๋‹ˆ๋‹ค. .html๋กœ ๋๋‚˜๋ฉด HTML ์ฝ˜ํ…์ธ  ์ž์ฒด๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

scrapling extract get 'https://example.com' content.md
scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome'  # CSS selector '#fromSkipToProducts'์— ๋งค์นญ๋˜๋Š” ๋ชจ๋“  ์š”์†Œ
scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless
scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare

MCP ์„œ๋ฒ„์™€ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ Web Scraping Shell ๋“ฑ ๋” ๋งŽ์€ ๊ธฐ๋Šฅ์ด ์žˆ์ง€๋งŒ, ์ด ํŽ˜์ด์ง€๋Š” ๊ฐ„๊ฒฐํ•˜๊ฒŒ ์œ ์ง€ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์ „์ฒด ๋ฌธ์„œ๋Š” ์—ฌ๊ธฐ์—์„œ ํ™•์ธํ•˜์„ธ์š”.

์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํฌ

Scrapling์€ ๊ฐ•๋ ฅํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ดˆ๊ณ ์†์ž…๋‹ˆ๋‹ค. ์•„๋ž˜ ๋ฒค์น˜๋งˆํฌ๋Š” Scrapling์˜ ํŒŒ์„œ๋ฅผ ๋‹ค๋ฅธ ์ธ๊ธฐ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ์ตœ์‹  ๋ฒ„์ „๊ณผ ๋น„๊ตํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ํ…์ŠคํŠธ ์ถ”์ถœ ์†๋„ ํ…Œ์ŠคํŠธ (5000๊ฐœ ์ค‘์ฒฉ ์š”์†Œ)

# Library Time (ms) vs Scrapling
1 Scrapling 2.02 1.0x
2 Parsel/Scrapy 2.04 1.01
3 Raw Lxml 2.54 1.257
4 PyQuery 24.17 ~12x
5 Selectolax 82.63 ~41x
6 MechanicalSoup 1549.71 ~767.1x
7 BS4 with Lxml 1584.31 ~784.3x
8 BS4 with html5lib 3391.91 ~1679.1x

์š”์†Œ ์œ ์‚ฌ๋„ & ํ…์ŠคํŠธ ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ

Scrapling์˜ ์ ์‘ํ˜• ์š”์†Œ ์ฐพ๊ธฐ ๊ธฐ๋Šฅ์€ ๋Œ€์•ˆ๋“ค์„ ํฌ๊ฒŒ ์•ž์„ญ๋‹ˆ๋‹ค:

Library Time (ms) vs Scrapling
Scrapling 2.39 1.0x
AutoScraper 12.45 5.209x

๋ชจ๋“  ๋ฒค์น˜๋งˆํฌ๋Š” 100ํšŒ ์ด์ƒ ์‹คํ–‰์˜ ํ‰๊ท ์ž…๋‹ˆ๋‹ค. ์ธก์ • ๋ฐฉ๋ฒ•์€ benchmarks.py๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

์„ค์น˜

Scrapling์€ Python 3.10 ์ด์ƒ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

pip install scrapling

์ด ์„ค์น˜์—๋Š” ํŒŒ์„œ ์—”์ง„๊ณผ ์˜์กด์„ฑ๋งŒ ํฌํ•จ๋˜๋ฉฐ, Fetcher๋‚˜ ์ปค๋งจ๋“œ๋ผ์ธ ์˜์กด์„ฑ์€ ํฌํ•จ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์„ ํƒ์  ์˜์กด์„ฑ

  1. ์•„๋ž˜์˜ ์ถ”๊ฐ€ ๊ธฐ๋Šฅ, Fetcher, ๋˜๋Š” ๊ด€๋ จ ํด๋ž˜์Šค๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด Fetcher ์˜์กด์„ฑ๊ณผ ๋ธŒ๋ผ์šฐ์ € ์˜์กด์„ฑ์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

    pip install "scrapling[fetchers]"
    
    scrapling install           # ์ผ๋ฐ˜ ์„ค์น˜
    scrapling install  --force  # ๊ฐ•์ œ ์žฌ์„ค์น˜
    

    ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋ชจ๋“  ๋ธŒ๋ผ์šฐ์ €์™€ ์‹œ์Šคํ…œ ์˜์กด์„ฑ, fingerprint ์กฐ์ž‘ ์˜์กด์„ฑ์ด ๋‹ค์šด๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค.

    ๋˜๋Š” ๋ช…๋ น์–ด ๋Œ€์‹  ์ฝ”๋“œ์—์„œ ์„ค์น˜ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค:

    from scrapling.cli import install
    
    install([], standalone_mode=False)          # ์ผ๋ฐ˜ ์„ค์น˜
    install(["--force"], standalone_mode=False) # ๊ฐ•์ œ ์žฌ์„ค์น˜
    
  2. ์ถ”๊ฐ€ ๊ธฐ๋Šฅ:

    • MCP ์„œ๋ฒ„ ๊ธฐ๋Šฅ ์„ค์น˜:
      pip install "scrapling[ai]"
      
    • Shell ๊ธฐ๋Šฅ (Web Scraping Shell ๋ฐ extract ๋ช…๋ น์–ด) ์„ค์น˜:
      pip install "scrapling[shell]"
      
    • ๋ชจ๋“  ๊ธฐ๋Šฅ ์„ค์น˜:
      pip install "scrapling[all]"
      
      ์œ„ ์ถ”๊ฐ€ ๊ธฐ๋Šฅ์„ ์„ค์น˜ํ•œ ํ›„์—๋„ (์•„์ง ํ•˜์ง€ ์•Š์•˜๋‹ค๋ฉด) scrapling install๋กœ ๋ธŒ๋ผ์šฐ์ € ์˜์กด์„ฑ์„ ์„ค์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Docker

DockerHub์—์„œ ๋ชจ๋“  ์ถ”๊ฐ€ ๊ธฐ๋Šฅ๊ณผ ๋ธŒ๋ผ์šฐ์ €๊ฐ€ ํฌํ•จ๋œ Docker ์ด๋ฏธ์ง€๋ฅผ ์„ค์น˜ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค:

docker pull pyd4vinci/scrapling

๋˜๋Š” GitHub ๋ ˆ์ง€์ŠคํŠธ๋ฆฌ์—์„œ ๋‹ค์šด๋กœ๋“œ:

docker pull ghcr.io/d4vinci/scrapling:latest

์ด ์ด๋ฏธ์ง€๋Š” GitHub Actions์™€ ๋ ˆํฌ์ง€ํ† ๋ฆฌ์˜ main ๋ธŒ๋žœ์น˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ž๋™์œผ๋กœ ๋นŒ๋“œ ๋ฐ ํ‘ธ์‹œ๋ฉ๋‹ˆ๋‹ค.

๊ธฐ์—ฌํ•˜๊ธฐ

๊ธฐ์—ฌ๋ฅผ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค! ์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๊ธฐ์—ฌ ๊ฐ€์ด๋“œ๋ผ์ธ์„ ์ฝ์–ด์ฃผ์„ธ์š”.

๋ฉด์ฑ… ์กฐํ•ญ

์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ๊ต์œก ๋ฐ ์—ฐ๊ตฌ ๋ชฉ์ ์œผ๋กœ๋งŒ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค. ์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ, ๊ตญ๋‚ด์™ธ ๋ฐ์ดํ„ฐ ์Šคํฌ๋ ˆ์ดํ•‘ ๋ฐ ๊ฐœ์ธ์ •๋ณด ๋ณดํ˜ธ ๊ด€๋ จ ๋ฒ•๋ฅ ์„ ์ค€์ˆ˜ํ•˜๋Š” ๋ฐ ๋™์˜ํ•œ ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผ๋ฉ๋‹ˆ๋‹ค. ์ €์ž์™€ ๊ธฐ์—ฌ์ž๋Š” ์ด ์†Œํ”„ํŠธ์›จ์–ด์˜ ์˜ค์šฉ์— ๋Œ€ํ•ด ์ฑ…์ž„์ง€์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•ญ์ƒ ์›น์‚ฌ์ดํŠธ์˜ ์ด์šฉ์•ฝ๊ด€๊ณผ robots.txt ํŒŒ์ผ์„ ์กด์ค‘ํ•˜์„ธ์š”.

๐ŸŽ“ ์ธ์šฉ

์—ฐ๊ตฌ ๋ชฉ์ ์œผ๋กœ ์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์…จ๋‹ค๋ฉด, ์•„๋ž˜ ์ฐธ๊ณ  ๋ฌธํ—Œ์œผ๋กœ ์ธ์šฉํ•ด ์ฃผ์„ธ์š”:

  @misc{scrapling,
    author = {Karim Shoair},
    title = {Scrapling},
    year = {2024},
    url = {https://github.com/D4Vinci/Scrapling},
    note = {An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!}
  }

๋ผ์ด์„ ์Šค

์ด ํ”„๋กœ์ ํŠธ๋Š” BSD-3-Clause ๋ผ์ด์„ ์Šค ํ•˜์— ๋ฐฐํฌ๋ฉ๋‹ˆ๋‹ค.

๊ฐ์‚ฌ์˜ ๋ง

์ด ํ”„๋กœ์ ํŠธ์—๋Š” ๋‹ค์Œ์—์„œ ์ฐจ์šฉํ•œ ์ฝ”๋“œ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

  • Parsel (BSD ๋ผ์ด์„ ์Šค) โ€” translator ์„œ๋ธŒ๋ชจ๋“ˆ์— ์‚ฌ์šฉ

Karim Shoair๊ฐ€ โค๏ธ์œผ๋กœ ๋””์ž์ธํ•˜๊ณ  ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค.