Scrapling / docs /README_KR.md
Karim shoair
docs: adding a new sponsor
a0dfed9
<!-- mcp-name: io.github.D4Vinci/Scrapling -->
<h1 align="center">
<a href="https://scrapling.readthedocs.io">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_dark.svg?sanitize=true">
<img alt="Scrapling Poster" src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_light.svg?sanitize=true">
</picture>
</a>
<br>
<small>Effortless Web Scraping for the Modern Web</small>
</h1>
<p align="center">
<a href="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml" alt="Tests">
<img alt="Tests" src="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg"></a>
<a href="https://badge.fury.io/py/Scrapling" alt="PyPI version">
<img alt="PyPI version" src="https://badge.fury.io/py/Scrapling.svg"></a>
<a href="https://clickpy.clickhouse.com/dashboard/scrapling" rel="nofollow"><img src="https://img.shields.io/pypi/dm/scrapling" alt="PyPI package downloads"></a>
<a href="https://github.com/D4Vinci/Scrapling/tree/main/agent-skill" alt="AI Agent Skill directory">
<img alt="Static Badge" src="https://img.shields.io/badge/Skill-black?style=flat&label=Agent&link=https%3A%2F%2Fgithub.com%2FD4Vinci%2FScrapling%2Ftree%2Fmain%2Fagent-skill"></a>
<a href="https://clawhub.ai/D4Vinci/scrapling-official" alt="OpenClaw Skill">
<img alt="OpenClaw Skill" src="https://img.shields.io/badge/Clawhub-darkred?style=flat&label=OpenClaw&link=https%3A%2F%2Fclawhub.ai%2FD4Vinci%2Fscrapling-official"></a>
<br/>
<a href="https://discord.gg/EMgGbDceNQ" alt="Discord" target="_blank">
<img alt="Discord" src="https://img.shields.io/discord/1360786381042880532?style=social&logo=discord&link=https%3A%2F%2Fdiscord.gg%2FEMgGbDceNQ">
</a>
<a href="https://x.com/Scrapling_dev" alt="X (formerly Twitter)">
<img alt="X (formerly Twitter) Follow" src="https://img.shields.io/twitter/follow/Scrapling_dev?style=social&logo=x&link=https%3A%2F%2Fx.com%2FScrapling_dev">
</a>
<br/>
<a href="https://pypi.org/project/scrapling/" alt="Supported Python versions">
<img alt="Supported Python versions" src="https://img.shields.io/pypi/pyversions/scrapling.svg"></a>
</p>
<p align="center">
<a href="https://scrapling.readthedocs.io/en/latest/parsing/selection.html"><strong>์„ ํƒ ๋ฉ”์„œ๋“œ</strong></a>
&middot;
<a href="https://scrapling.readthedocs.io/en/latest/fetching/choosing.html"><strong>Fetcher ์„ ํƒ ๊ฐ€์ด๋“œ</strong></a>
&middot;
<a href="https://scrapling.readthedocs.io/en/latest/spiders/architecture.html"><strong>Spider</strong></a>
&middot;
<a href="https://scrapling.readthedocs.io/en/latest/spiders/proxy-blocking.html"><strong>ํ”„๋ก์‹œ ๋กœํ…Œ์ด์…˜</strong></a>
&middot;
<a href="https://scrapling.readthedocs.io/en/latest/cli/overview.html"><strong>CLI</strong></a>
&middot;
<a href="https://scrapling.readthedocs.io/en/latest/ai/mcp-server.html"><strong>MCP ์„œ๋ฒ„</strong></a>
</p>
Scrapling์€ ๋‹จ์ผ ์š”์ฒญ๋ถ€ํ„ฐ ๋Œ€๊ทœ๋ชจ ํฌ๋กค๋ง๊นŒ์ง€ ๋ชจ๋“  ๊ฒƒ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ์ ์‘ํ˜• Web Scraping ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค.
ํŒŒ์„œ๋Š” ์›น์‚ฌ์ดํŠธ ๋ณ€๊ฒฝ ์‚ฌํ•ญ์„ ํ•™์Šตํ•˜๊ณ , ํŽ˜์ด์ง€๊ฐ€ ์—…๋ฐ์ดํŠธ๋˜๋ฉด ์š”์†Œ๋ฅผ ์ž๋™์œผ๋กœ ์žฌ๋ฐฐ์น˜ํ•ฉ๋‹ˆ๋‹ค. Fetcher๋Š” Cloudflare Turnstile ๊ฐ™์€ ์•ˆํ‹ฐ๋ด‡ ์‹œ์Šคํ…œ์„ ๋ณ„๋„ ์„ค์ • ์—†์ด ์šฐํšŒํ•ฉ๋‹ˆ๋‹ค. Spider ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ผ์‹œ์ •์ง€/์žฌ๊ฐœ ๋ฐ ์ž๋™ ํ”„๋ก์‹œ ๋กœํ…Œ์ด์…˜์„ ๊ฐ–์ถ˜ ๋™์‹œ ๋ฉ€ํ‹ฐ ์„ธ์…˜ ํฌ๋กค๋ง์œผ๋กœ ํ™•์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค โ€” ๋ชจ๋‘ Python ๋ช‡ ์ค„์ด๋ฉด ๋ฉ๋‹ˆ๋‹ค. ํ•˜๋‚˜์˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ, ํƒ€ํ˜‘ ์—†๋Š” ์„ฑ๋Šฅ.
์‹ค์‹œ๊ฐ„ ํ†ต๊ณ„์™€ ์ŠคํŠธ๋ฆฌ๋ฐ์„ ํ†ตํ•œ ์ดˆ๊ณ ์† ํฌ๋กค๋ง. Web Scraper๊ฐ€ ๋งŒ๋“ค๊ณ , Web Scraper์™€ ์ผ๋ฐ˜ ์‚ฌ์šฉ์ž ๋ชจ๋‘๋ฅผ ์œ„ํ•ด ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.
```python
from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
StealthyFetcher.adaptive = True
p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # ํƒ์ง€๋ฅผ ํ”ผํ•ด ์›น์‚ฌ์ดํŠธ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค!
products = p.css('.product', auto_save=True) # ์›น์‚ฌ์ดํŠธ ๋””์ž์ธ ๋ณ€๊ฒฝ์—๋„ ์‚ด์•„๋‚จ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์Šคํฌ๋ ˆ์ดํ•‘!
products = p.css('.product', adaptive=True) # ๋‚˜์ค‘์— ์›น์‚ฌ์ดํŠธ ๊ตฌ์กฐ๊ฐ€ ๋ฐ”๋€Œ๋ฉด, `adaptive=True`๋ฅผ ์ „๋‹ฌํ•ด์„œ ์ฐพ์œผ์„ธ์š”!
```
๋˜๋Š” ๋ณธ๊ฒฉ์ ์ธ ํฌ๋กค๋ง์œผ๋กœ ํ™•์žฅ
```python
from scrapling.spiders import Spider, Response
class MySpider(Spider):
name = "demo"
start_urls = ["https://example.com/"]
async def parse(self, response: Response):
for item in response.css('.product'):
yield {"title": item.css('h2::text').get()}
MySpider().start()
```
<p align="center">
<a href="https://dataimpulse.com/?utm_source=scrapling&utm_medium=banner&utm_campaign=scrapling" target="_blank" style="display:flex; justify-content:center; padding:4px 0;">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/DataImpulse.png" alt="At DataImpulse, we specialize in developing custom proxy services for your business. Make requests from anywhere, collect data, and enjoy fast connections with our premium proxies." style="max-height:60px;">
</a>
</p>
# ํ”Œ๋ž˜ํ‹ฐ๋„˜ ์Šคํฐ์„œ
<table>
<tr>
<td width="200">
<a href="https://hypersolutions.co/?utm_source=github&utm_medium=readme&utm_campaign=scrapling" target="_blank" title="Bot Protection Bypass API for Akamai, DataDome, Incapsula & Kasada">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/HyperSolutions.png">
</a>
</td>
<td> Scrapling์€ Cloudflare Turnstile์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์—”ํ„ฐํ”„๋ผ์ด์ฆˆ๊ธ‰ ๋ณดํ˜ธ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๋ฉด, <a href="https://hypersolutions.co?utm_source=github&utm_medium=readme&utm_campaign=scrapling">
<b>Hyper Solutions</b>
</a>๊ฐ€ <b>Akamai</b>, <b>DataDome</b>, <b>Kasada</b>, <b>Incapsula</b>์šฉ ์œ ํšจํ•œ ์•ˆํ‹ฐ๋ด‡ ํ† ํฐ์„ ์ƒ์„ฑํ•˜๋Š” API ์—”๋“œํฌ์ธํŠธ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๊ฐ„๋‹จํ•œ API ํ˜ธ์ถœ๋งŒ์œผ๋กœ, ๋ธŒ๋ผ์šฐ์ € ์ž๋™ํ™”๊ฐ€ ํ•„์š” ์—†์Šต๋‹ˆ๋‹ค. </td>
</tr>
<tr>
<td width="200">
<a href="https://birdproxies.com/t/scrapling" target="_blank" title="At Bird Proxies, we eliminate your pains such as banned IPs, geo restriction, and high costs so you can focus on your work.">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/BirdProxies.jpg">
</a>
</td>
<td>ํ”„๋ก์‹œ๋Š” ๋ณต์žกํ•˜๊ฑฐ๋‚˜ ๋น„์Œ€ ์ด์œ ๊ฐ€ ์—†๋‹ค๋Š” ์ƒ๊ฐ์œผ๋กœ <a href="https://birdproxies.com/t/scrapling">
<b>BirdProxies</b>
</a>๋ฅผ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค. <br /> 195๊ฐœ ์ด์ƒ ์ง€์—ญ์˜ ๋น ๋ฅธ ๋ ˆ์ง€๋ด์…œ ๋ฐ ISP ํ”„๋ก์‹œ, ํ•ฉ๋ฆฌ์ ์ธ ๊ฐ€๊ฒฉ, ์‹ค์งˆ์ ์ธ ์ง€์›. <br />
<b>๋žœ๋”ฉ ํŽ˜์ด์ง€์—์„œ FlappyBird ๊ฒŒ์ž„์„ ํ”Œ๋ ˆ์ดํ•˜๊ณ  ๋ฌด๋ฃŒ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์œผ์„ธ์š”!</b>
</td>
</tr>
<tr>
<td width="200">
<a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling" target="_blank" title="Evomi is your Swiss Quality Proxy Provider, starting at $0.49/GB">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/evomi.png">
</a>
</td>
<td>
<a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling">
<b>Evomi</b>
</a>: ๋ ˆ์ง€๋ด์…œ ํ”„๋ก์‹œ GB๋‹น $0.49๋ถ€ํ„ฐ. ์™„์ „ํžˆ ์œ„์žฅ๋œ Chromium ์Šคํฌ๋ ˆ์ดํ•‘ ๋ธŒ๋ผ์šฐ์ €, ๋ ˆ์ง€๋ด์…œ IP, ์ž๋™ CAPTCHA ํ•ด๊ฒฐ, ์•ˆํ‹ฐ๋ด‡ ์šฐํšŒ.</br>
<b>Scraper API๋กœ ๋ฒˆ๊ฑฐ๋กœ์›€ ์—†์ด ๊ฒฐ๊ณผ๋ฅผ ์–ป์œผ์„ธ์š”. MCP ๋ฐ N8N ํ†ตํ•ฉ ์ง€์›.</b>
</td>
</tr>
<tr>
<td width="200">
<a href="https://tikhub.io/?ref=KarimShoair" target="_blank" title="Unlock the Power of Social Media Data & AI">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/TikHub.jpg">
</a>
</td>
<td>
<a href="https://tikhub.io/?ref=KarimShoair" target="_blank">TikHub.io</a>๋Š” TikTok, X, YouTube, Instagram ๋“ฑ 16๊ฐœ ์ด์ƒ ํ”Œ๋žซํผ์—์„œ 900๊ฐœ ์ด์ƒ์˜ ์•ˆ์ •์ ์ธ API๋ฅผ ์ œ๊ณตํ•˜๋ฉฐ, 4,000๋งŒ ์ด์ƒ์˜ ๋ฐ์ดํ„ฐ์…‹์„ ๋ณด์œ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. <br /> <a href="https://ai.tikhub.io/?ref=KarimShoair" target="_blank">ํ• ์ธ๋œ AI ๋ชจ๋ธ</a>๋„ ์ œ๊ณต โ€” Claude, GPT, GEMINI ๋“ฑ ์ตœ๋Œ€ 71% ํ• ์ธ.
</td>
</tr>
<tr>
<td width="200">
<a href="https://www.nsocks.com/?keyword=2p67aivg" target="_blank" title="Scalable Web Data Access for AI Applications">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/nsocks.png">
</a>
</td>
<td>
<a href="https://www.nsocks.com/?keyword=2p67aivg" target="_blank">Nsocks</a>๋Š” ๊ฐœ๋ฐœ์ž์™€ ์Šคํฌ๋ ˆ์ดํผ๋ฅผ ์œ„ํ•œ ๋น ๋ฅธ ๋ ˆ์ง€๋ด์…œ ๋ฐ ISP ํ”„๋ก์‹œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๊ธ€๋กœ๋ฒŒ IP ์ปค๋ฒ„๋ฆฌ์ง€, ๋†’์€ ์ต๋ช…์„ฑ, ์Šค๋งˆํŠธ ๋กœํ…Œ์ด์…˜, ์ž๋™ํ™”์™€ ๋ฐ์ดํ„ฐ ์ถ”์ถœ์„ ์œ„ํ•œ ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ. <a href="https://www.xcrawl.com/?keyword=2p67aivg" target="_blank">Xcrawl</a>๋กœ ๋Œ€๊ทœ๋ชจ ์›น ํฌ๋กค๋ง์„ ๊ฐ„์†Œํ™”ํ•˜์„ธ์š”.
</td>
</tr>
<tr>
<td width="200">
<a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting.">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png">
</a>
</td>
<td>
๋…ธํŠธ๋ถ์„ ๋‹ซ์œผ์„ธ์š”. ์Šคํฌ๋ž˜ํผ๋Š” ๊ณ„์† ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. <br />
<a href="https://petrosky.io/d4vinci" target="_blank">PetroSky VPS</a> - ๋…ผ์Šคํ†ฑ ์ž๋™ํ™”๋ฅผ ์œ„ํ•œ ํด๋ผ์šฐ๋“œ ์„œ๋ฒ„. Windows ๋ฐ Linux ๋จธ์‹ ์„ ์™„๋ฒฝํ•˜๊ฒŒ ์ œ์–ด. ์›” โ‚ฌ6.99๋ถ€ํ„ฐ.
</td>
</tr>
<tr>
<td width="200">
<a href="https://substack.thewebscraping.club/p/scrapling-hands-on-guide?utm_source=github&utm_medium=repo&utm_campaign=scrapling" target="_blank" title="The #1 newsletter dedicated to Web Scraping">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/TWSC.png">
</a>
</td>
<td>
<a href="https://substack.thewebscraping.club/p/scrapling-hands-on-guide?utm_source=github&utm_medium=repo&utm_campaign=scrapling" target="_blank">The Web Scraping Club์—์„œ Scrapling์˜ ์ „์ฒด ๋ฆฌ๋ทฐ</a>(2025๋…„ 11์›”)๋ฅผ ์ฝ์–ด๋ณด์„ธ์š”. ์›น ์Šคํฌ๋ž˜ํ•‘ ์ „๋ฌธ No.1 ๋‰ด์Šค๋ ˆํ„ฐ์ž…๋‹ˆ๋‹ค.
</td>
</tr>
<tr>
<td width="200">
<a href="https://proxy-seller.com/?partner=CU9CAA5TBYFFT2" target="_blank" title="Proxy-Seller provides reliable proxy infrastructure for Web Scraping">
<img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxySeller.png">
</a>
</td>
<td>
<a href="https://proxy-seller.com/?partner=CU9CAA5TBYFFT2" target="_blank">Proxy-Seller</a>๋Š” ์›น ์Šคํฌ๋ž˜ํ•‘์„ ์œ„ํ•œ ์•ˆ์ •์ ์ธ ํ”„๋ก์‹œ ์ธํ”„๋ผ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. IPv4, IPv6, ISP, ์ฃผ๊ฑฐ์šฉ ๋ฐ ๋ชจ๋ฐ”์ผ ํ”„๋ก์‹œ๋ฅผ ์ง€์›ํ•˜๋ฉฐ, ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ, ๊ด‘๋ฒ”์œ„ํ•œ ์ง€์—ญ ์ปค๋ฒ„๋ฆฌ์ง€, ๊ธฐ์—… ๊ทœ๋ชจ์˜ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์„ ์œ„ํ•œ ์œ ์—ฐํ•œ ์š”๊ธˆ์ œ๋ฅผ ๊ฐ–์ถ”๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
</td>
</tr>
</table>
<i><sub>์—ฌ๊ธฐ์— ๊ด‘๊ณ ๋ฅผ ๊ฒŒ์žฌํ•˜๊ณ  ์‹ถ์œผ์‹ ๊ฐ€์š”? [์—ฌ๊ธฐ](https://github.com/sponsors/D4Vinci/sponsorships?tier_id=586646)๋ฅผ ํด๋ฆญํ•˜์„ธ์š”</sub></i>
# ์Šคํฐ์„œ
<!-- sponsors -->
<a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
<a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
<a href="https://hasdata.com/?utm_source=github&utm_medium=banner&utm_campaign=D4Vinci" target="_blank" title="The web scraping service that actually beats anti-bot systems!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/hasdata.png"></a>
<a href="https://proxyempire.io/?ref=scrapling&utm_source=scrapling" target="_blank" title="Collect The Data Your Project Needs with the Best Residential Proxies"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxyEmpire.png"></a>
<a href="https://www.webshare.io/?referral_code=48r2m2cd5uz1" target="_blank" title="The Most Reliable Proxy with Unparalleled Performance"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/webshare.png"></a>
<a href="https://browser.cash/?utm_source=D4Vinci&utm_medium=referral" target="_blank" title="Browser Automation & AI Browser Agent Platform"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/browserCash.png"></a>
<!-- /sponsors -->
<i><sub>์—ฌ๊ธฐ์— ๊ด‘๊ณ ๋ฅผ ๊ฒŒ์žฌํ•˜๊ณ  ์‹ถ์œผ์‹ ๊ฐ€์š”? [์—ฌ๊ธฐ](https://github.com/sponsors/D4Vinci)๋ฅผ ํด๋ฆญํ•˜๊ณ  ์›ํ•˜๋Š” ํ‹ฐ์–ด๋ฅผ ์„ ํƒํ•˜์„ธ์š”!</sub></i>
---
## ์ฃผ์š” ๊ธฐ๋Šฅ
### Spider โ€” ๋ณธ๊ฒฉ์ ์ธ ํฌ๋กค๋ง ํ”„๋ ˆ์ž„์›Œํฌ
- ๐Ÿ•ท๏ธ **Scrapy ์Šคํƒ€์ผ Spider API**: `start_urls`, ๋น„๋™๊ธฐ `parse` ์ฝœ๋ฐฑ, `Request`/`Response` ๊ฐ์ฒด๋กœ Spider๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.
- โšก **๋™์‹œ ํฌ๋กค๋ง**: ์„ค์ • ๊ฐ€๋Šฅํ•œ ๋™์‹œ ์š”์ฒญ ์ˆ˜ ์ œํ•œ, ๋„๋ฉ”์ธ๋ณ„ ์Šค๋กœํ‹€๋ง, ๋‹ค์šด๋กœ๋“œ ๋”œ๋ ˆ์ด๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
- ๐Ÿ”„ **๋ฉ€ํ‹ฐ ์„ธ์…˜ ์ง€์›**: HTTP ์š”์ฒญ๊ณผ ์Šคํ…”์Šค ํ—ค๋“œ๋ฆฌ์Šค ๋ธŒ๋ผ์šฐ์ €๋ฅผ ํ•˜๋‚˜์˜ ์ธํ„ฐํŽ˜์ด์Šค๋กœ ํ†ตํ•ฉ โ€” ID๋กœ ์š”์ฒญ์„ ๋‹ค๋ฅธ ์„ธ์…˜์— ๋ผ์šฐํŒ…ํ•ฉ๋‹ˆ๋‹ค.
- ๐Ÿ’พ **์ผ์‹œ์ •์ง€ & ์žฌ๊ฐœ**: ์ฒดํฌํฌ์ธํŠธ ๊ธฐ๋ฐ˜์˜ ํฌ๋กค๋ง ์˜์†ํ™”. Ctrl+C๋กœ ์ •์ƒ ์ข…๋ฃŒํ•˜๊ณ , ์žฌ์‹œ์ž‘ํ•˜๋ฉด ์ค‘๋‹จ๋œ ์ง€์ ๋ถ€ํ„ฐ ์ด์–ด๊ฐ‘๋‹ˆ๋‹ค.
- ๐Ÿ“ก **์ŠคํŠธ๋ฆฌ๋ฐ ๋ชจ๋“œ**: `async for item in spider.stream()`์œผ๋กœ ์Šคํฌ๋ ˆ์ดํ•‘๋œ ์•„์ดํ…œ์„ ์‹ค์‹œ๊ฐ„ ํ†ต๊ณ„์™€ ํ•จ๊ป˜ ์ŠคํŠธ๋ฆฌ๋ฐ์œผ๋กœ ์ˆ˜์‹  โ€” UI, ํŒŒ์ดํ”„๋ผ์ธ, ์žฅ์‹œ๊ฐ„ ํฌ๋กค๋ง์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.
- ๐Ÿ›ก๏ธ **์ฐจ๋‹จ๋œ ์š”์ฒญ ๊ฐ์ง€**: ์ปค์Šคํ…€ ๋กœ์ง์„ ํ†ตํ•œ ์ฐจ๋‹จ๋œ ์š”์ฒญ์˜ ์ž๋™ ๊ฐ์ง€ ๋ฐ ์žฌ์‹œ๋„๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
- ๐Ÿ“ฆ **๋‚ด์žฅ ๋‚ด๋ณด๋‚ด๊ธฐ**: ํ›…์ด๋‚˜ ์ž์ฒด ํŒŒ์ดํ”„๋ผ์ธ, ๋˜๋Š” ๋‚ด์žฅ JSON/JSONL๋กœ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋ณด๋ƒ…๋‹ˆ๋‹ค. ๊ฐ๊ฐ `result.items.to_json()` / `result.items.to_jsonl()`์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
### ์„ธ์…˜์„ ์ง€์›ํ•˜๋Š” ๊ณ ๊ธ‰ ์›น์‚ฌ์ดํŠธ ๊ฐ€์ ธ์˜ค๊ธฐ
- **HTTP ์š”์ฒญ**: `Fetcher` ํด๋ž˜์Šค๋กœ ๋น ๋ฅด๊ณ  ์€๋ฐ€ํ•œ HTTP ์š”์ฒญ. ๋ธŒ๋ผ์šฐ์ €์˜ TLS fingerprint, ํ—ค๋”๋ฅผ ๋ชจ๋ฐฉํ•˜๊ณ , HTTP/3๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
- **๋™์  ๋กœ๋”ฉ**: Playwright์˜ Chromium๊ณผ Google Chrome์„ ์ง€์›ํ•˜๋Š” `DynamicFetcher` ํด๋ž˜์Šค๋กœ ์™„์ „ํ•œ ๋ธŒ๋ผ์šฐ์ € ์ž๋™ํ™”๋ฅผ ํ†ตํ•ด ๋™์  ์›น์‚ฌ์ดํŠธ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.
- **์•ˆํ‹ฐ๋ด‡ ์šฐํšŒ**: `StealthyFetcher`์™€ fingerprint ์œ„์žฅ์„ ํ†ตํ•œ ๊ณ ๊ธ‰ ์Šคํ…”์Šค ๊ธฐ๋Šฅ. ์ž๋™ํ™”๋กœ ๋ชจ๋“  ์œ ํ˜•์˜ Cloudflare Turnstile/Interstitial์„ ์†์‰ฝ๊ฒŒ ์šฐํšŒํ•ฉ๋‹ˆ๋‹ค.
- **์„ธ์…˜ ๊ด€๋ฆฌ**: `FetcherSession`, `StealthySession`, `DynamicSession` ํด๋ž˜์Šค๋กœ ์š”์ฒญ ๊ฐ„ ์ฟ ํ‚ค์™€ ์ƒํƒœ๋ฅผ ๊ด€๋ฆฌํ•˜๋Š” ์˜์†์  ์„ธ์…˜์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
- **ํ”„๋ก์‹œ ๋กœํ…Œ์ด์…˜**: ๋ชจ๋“  ์„ธ์…˜ ํƒ€์ž…์— ๋Œ€์‘ํ•˜๋Š” ์ˆœํ™˜ ๋˜๋Š” ์ปค์Šคํ…€ ์ „๋žต์˜ ๋‚ด์žฅ `ProxyRotator`์™€ ์š”์ฒญ๋ณ„ ํ”„๋ก์‹œ ์˜ค๋ฒ„๋ผ์ด๋“œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
- **๋„๋ฉ”์ธ ์ฐจ๋‹จ**: ๋ธŒ๋ผ์šฐ์ € ๊ธฐ๋ฐ˜ Fetcher์—์„œ ํŠน์ • ๋„๋ฉ”์ธ(๋ฐ ํ•˜์œ„ ๋„๋ฉ”์ธ)์œผ๋กœ์˜ ์š”์ฒญ์„ ์ฐจ๋‹จํ•ฉ๋‹ˆ๋‹ค.
- **๋น„๋™๊ธฐ ์ง€์›**: ๋ชจ๋“  Fetcher์™€ ์ „์šฉ ๋น„๋™๊ธฐ ์„ธ์…˜ ํด๋ž˜์Šค์—์„œ ์™„์ „ํ•œ ๋น„๋™๊ธฐ๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
### ์ ์‘ํ˜• ์Šคํฌ๋ ˆ์ดํ•‘ & AI ํ†ตํ•ฉ
- ๐Ÿ”„ **์Šค๋งˆํŠธ ์š”์†Œ ์ถ”์ **: ์ง€๋Šฅ์ ์ธ ์œ ์‚ฌ๋„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์›น์‚ฌ์ดํŠธ ๋ณ€๊ฒฝ ํ›„์—๋„ ์š”์†Œ๋ฅผ ์žฌ๋ฐฐ์น˜ํ•ฉ๋‹ˆ๋‹ค.
- ๐ŸŽฏ **์œ ์—ฐํ•œ ์Šค๋งˆํŠธ ์„ ํƒ**: CSS selector, XPath selector, ํ•„ํ„ฐ ๊ธฐ๋ฐ˜ ๊ฒ€์ƒ‰, ํ…์ŠคํŠธ ๊ฒ€์ƒ‰, ์ •๊ทœ์‹ ๊ฒ€์ƒ‰ ๋“ฑ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
- ๐Ÿ” **์œ ์‚ฌ ์š”์†Œ ์ฐพ๊ธฐ**: ๋ฐœ๊ฒฌ๋œ ์š”์†Œ์™€ ์œ ์‚ฌํ•œ ์š”์†Œ๋ฅผ ์ž๋™์œผ๋กœ ์ฐพ์•„๋ƒ…๋‹ˆ๋‹ค.
- ๐Ÿค– **AI์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” MCP ์„œ๋ฒ„**: AI ๊ธฐ๋ฐ˜ Web Scraping๊ณผ ๋ฐ์ดํ„ฐ ์ถ”์ถœ์„ ์œ„ํ•œ ๋‚ด์žฅ MCP ์„œ๋ฒ„. AI(Claude/Cursor ๋“ฑ)์— ์ „๋‹ฌํ•˜๊ธฐ ์ „์— Scrapling์„ ํ™œ์šฉํ•ด ๋Œ€์ƒ ์ฝ˜ํ…์ธ ๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ฐ•๋ ฅํ•œ ์ปค์Šคํ…€ ๊ธฐ๋Šฅ์„ ๊ฐ–์ถ”๊ณ  ์žˆ์–ด, ์ž‘์—… ์†๋„๋ฅผ ๋†’์ด๊ณ  ํ† ํฐ ์‚ฌ์šฉ๋Ÿ‰์„ ์ตœ์†Œํ™”ํ•ด ๋น„์šฉ์„ ์ ˆ๊ฐํ•ฉ๋‹ˆ๋‹ค. ([๋ฐ๋ชจ ์˜์ƒ](https://www.youtube.com/watch?v=qyFk3ZNwOxE))
### ๊ณ ์„ฑ๋Šฅ & ์‹ค์ „ ๊ฒ€์ฆ๋œ ์•„ํ‚คํ…์ฒ˜
- ๐Ÿš€ **์ดˆ๊ณ ์†**: ๋Œ€๋ถ€๋ถ„์˜ Python ์Šคํฌ๋ ˆ์ดํ•‘ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ๋Šฅ๊ฐ€ํ•˜๋Š” ์ตœ์ ํ™”๋œ ์„ฑ๋Šฅ.
- ๐Ÿ”‹ **๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ**: ์ตœ์ ํ™”๋œ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ์™€ ์ง€์—ฐ ๋กœ๋”ฉ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค.
- โšก **๊ณ ์† JSON ์ง๋ ฌํ™”**: ํ‘œ์ค€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ณด๋‹ค 10๋ฐฐ ๋น ๋ฆ…๋‹ˆ๋‹ค.
- ๐Ÿ—๏ธ **์‹ค์ „ ๊ฒ€์ฆ**: Scrapling์€ 92%์˜ ํ…Œ์ŠคํŠธ ์ปค๋ฒ„๋ฆฌ์ง€์™€ ์™„์ „ํ•œ ํƒ€์ž… ํžŒํŠธ ์ปค๋ฒ„๋ฆฌ์ง€๋ฅผ ๊ฐ–์ถ”๊ณ  ์žˆ์„ ๋ฟ ์•„๋‹ˆ๋ผ, ์ง€๋‚œ 1๋…„๊ฐ„ ์ˆ˜๋ฐฑ ๋ช…์˜ Web Scraper๊ฐ€ ๋งค์ผ ์‚ฌ์šฉํ•ด ์™”์Šต๋‹ˆ๋‹ค.
### ๊ฐœ๋ฐœ์ž/Web Scraper ์นœํ™”์  ๊ฒฝํ—˜
- ๐ŸŽฏ **์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ Web Scraping Shell**: Scrapling ํ†ตํ•ฉ, ๋‹จ์ถ•ํ‚ค, curl ์š”์ฒญ์„ Scrapling ์š”์ฒญ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ฑฐ๋‚˜ ๋ธŒ๋ผ์šฐ์ €์—์„œ ์š”์ฒญ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜๋Š” ๋“ฑ์˜ ๋„๊ตฌ๋ฅผ ๊ฐ–์ถ˜ ์„ ํƒ์  ๋‚ด์žฅ IPython Shell๋กœ, Web Scraping ์Šคํฌ๋ฆฝํŠธ ๊ฐœ๋ฐœ์„ ๊ฐ€์†ํ•ฉ๋‹ˆ๋‹ค.
- ๐Ÿš€ **ํ„ฐ๋ฏธ๋„์—์„œ ๋ฐ”๋กœ ์‚ฌ์šฉ**: ์ฝ”๋“œ ํ•œ ์ค„ ์—†์ด Scrapling์œผ๋กœ URL์„ ์Šคํฌ๋ ˆ์ดํ•‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!
- ๐Ÿ› ๏ธ **ํ’๋ถ€ํ•œ ๋‚ด๋น„๊ฒŒ์ด์…˜ API**: ๋ถ€๋ชจ, ํ˜•์ œ, ์ž์‹ ํƒ์ƒ‰ ๋ฉ”์„œ๋“œ๋ฅผ ํ†ตํ•œ ๊ณ ๊ธ‰ DOM ์ˆœํšŒ๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
- ๐Ÿงฌ **ํ–ฅ์ƒ๋œ ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ**: ๋‚ด์žฅ ์ •๊ทœ์‹, ํด๋ฆฌ๋‹ ๋ฉ”์„œ๋“œ, ์ตœ์ ํ™”๋œ ๋ฌธ์ž์—ด ์—ฐ์‚ฐ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
- ๐Ÿ“ **์ž๋™ ์…€๋ ‰ํ„ฐ ์ƒ์„ฑ**: ๋ชจ๋“  ์š”์†Œ์— ๋Œ€ํ•ด ๊ฒฌ๊ณ ํ•œ CSS/XPath selector๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
- ๐Ÿ”Œ **์ต์ˆ™ํ•œ API**: Scrapy/Parsel์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•œ ์˜์‚ฌ ์š”์†Œ(pseudo-element)๋ฅผ ๊ฐ€์ง„ Scrapy/BeautifulSoup ์Šคํƒ€์ผ์˜ API.
- ๐Ÿ“˜ **์™„์ „ํ•œ ํƒ€์ž… ์ปค๋ฒ„๋ฆฌ์ง€**: ๋›ฐ์–ด๋‚œ IDE ์ง€์›๊ณผ ์ฝ”๋“œ ์ž๋™์™„์„ฑ์„ ์œ„ํ•œ ์™„์ „ํ•œ ํƒ€์ž… ํžŒํŠธ. ์ฝ”๋“œ๋ฒ ์ด์Šค ์ „์ฒด๊ฐ€ ๋ณ€๊ฒฝ๋  ๋•Œ๋งˆ๋‹ค **PyRight**์™€ **MyPy**๋กœ ์ž๋™ ๊ฒ€์‚ฌ๋ฉ๋‹ˆ๋‹ค.
- ๐Ÿ”‹ **๋ฐ”๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ Docker ์ด๋ฏธ์ง€**: ๋งค ๋ฆด๋ฆฌ์Šค๋งˆ๋‹ค ๋ชจ๋“  ๋ธŒ๋ผ์šฐ์ €๋ฅผ ํฌํ•จํ•œ Docker ์ด๋ฏธ์ง€๊ฐ€ ์ž๋™์œผ๋กœ ๋นŒ๋“œ ๋ฐ ํ‘ธ์‹œ๋ฉ๋‹ˆ๋‹ค.
## ์‹œ์ž‘ํ•˜๊ธฐ
๊นŠ์ด ๋“ค์–ด๊ฐ€์ง€ ์•Š๊ณ , Scrapling์ด ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ๋“ค์„ ๊ฐ„๋‹จํžˆ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
### ๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•
์„ธ์…˜์„ ์ง€์›ํ•˜๋Š” HTTP ์š”์ฒญ
```python
from scrapling.fetchers import Fetcher, FetcherSession
with FetcherSession(impersonate='chrome') as session: # Chrome์˜ ์ตœ์‹  TLS fingerprint ์‚ฌ์šฉ
page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
quotes = page.css('.quote .text::text').getall()
# ๋˜๋Š” ์ผํšŒ์„ฑ ์š”์ฒญ ์‚ฌ์šฉ
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()
```
๊ณ ๊ธ‰ ์Šคํ…”์Šค ๋ชจ๋“œ
```python
from scrapling.fetchers import StealthyFetcher, StealthySession
with StealthySession(headless=True, solve_cloudflare=True) as session: # ์ž‘์—…์ด ๋๋‚  ๋•Œ๊นŒ์ง€ ๋ธŒ๋ผ์šฐ์ €๋ฅผ ์—ด์–ด๋‘ก๋‹ˆ๋‹ค
page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
data = page.css('#padded_content a').getall()
# ๋˜๋Š” ์ผํšŒ์„ฑ ์š”์ฒญ ์Šคํƒ€์ผ โ€” ์ด ์š”์ฒญ์„ ์œ„ํ•ด ๋ธŒ๋ผ์šฐ์ €๋ฅผ ์—ด๊ณ , ์™„๋ฃŒ ํ›„ ๋‹ซ์Šต๋‹ˆ๋‹ค
page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
data = page.css('#padded_content a').getall()
```
์™„์ „ํ•œ ๋ธŒ๋ผ์šฐ์ € ์ž๋™ํ™”
```python
from scrapling.fetchers import DynamicFetcher, DynamicSession
with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # ์ž‘์—…์ด ๋๋‚  ๋•Œ๊นŒ์ง€ ๋ธŒ๋ผ์šฐ์ €๋ฅผ ์—ด์–ด๋‘ก๋‹ˆ๋‹ค
page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
data = page.xpath('//span[@class="text"]/text()').getall() # ์›ํ•˜์‹œ๋ฉด XPath selector๋„ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
# ๋˜๋Š” ์ผํšŒ์„ฑ ์š”์ฒญ ์Šคํƒ€์ผ โ€” ์ด ์š”์ฒญ์„ ์œ„ํ•ด ๋ธŒ๋ผ์šฐ์ €๋ฅผ ์—ด๊ณ , ์™„๋ฃŒ ํ›„ ๋‹ซ์Šต๋‹ˆ๋‹ค
page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
data = page.css('.quote .text::text').getall()
```
### Spider
๋™์‹œ ์š”์ฒญ, ์—ฌ๋Ÿฌ ์„ธ์…˜ ํƒ€์ž…, ์ผ์‹œ์ •์ง€ & ์žฌ๊ฐœ๋ฅผ ๊ฐ–์ถ˜ ๋ณธ๊ฒฉ์ ์ธ ํฌ๋กค๋Ÿฌ ๊ตฌ์ถ•:
```python
from scrapling.spiders import Spider, Request, Response
class QuotesSpider(Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
concurrent_requests = 10
async def parse(self, response: Response):
for quote in response.css('.quote'):
yield {
"text": quote.css('.text::text').get(),
"author": quote.css('.author::text').get(),
}
next_page = response.css('.next a')
if next_page:
yield response.follow(next_page[0].attrib['href'])
result = QuotesSpider().start()
print(f"{len(result.items)}๊ฐœ์˜ ์ธ์šฉ๊ตฌ๋ฅผ ์Šคํฌ๋ ˆ์ดํ•‘ํ–ˆ์Šต๋‹ˆ๋‹ค")
result.items.to_json("quotes.json")
```
ํ•˜๋‚˜์˜ Spider์—์„œ ์—ฌ๋Ÿฌ ์„ธ์…˜ ํƒ€์ž… ์‚ฌ์šฉ:
```python
from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession
class MultiSessionSpider(Spider):
name = "multi"
start_urls = ["https://example.com/"]
def configure_sessions(self, manager):
manager.add("fast", FetcherSession(impersonate="chrome"))
manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
async def parse(self, response: Response):
for link in response.css('a::attr(href)').getall():
# ๋ณดํ˜ธ๋œ ํŽ˜์ด์ง€๋Š” ์Šคํ…”์Šค ์„ธ์…˜์„ ํ†ตํ•ด ๋ผ์šฐํŒ…
if "protected" in link:
yield Request(link, sid="stealth")
else:
yield Request(link, sid="fast", callback=self.parse) # ๋ช…์‹œ์  ์ฝœ๋ฐฑ
```
์ฒดํฌํฌ์ธํŠธ๋ฅผ ์‚ฌ์šฉํ•ด ์žฅ์‹œ๊ฐ„ ํฌ๋กค๋ง์„ ์ผ์‹œ์ •์ง€ & ์žฌ๊ฐœ:
```python
QuotesSpider(crawldir="./crawl_data").start()
```
Ctrl+C๋ฅผ ๋ˆ„๋ฅด๋ฉด ์ •์ƒ์ ์œผ๋กœ ์ผ์‹œ์ •์ง€๋˜๊ณ , ์ง„ํ–‰ ์ƒํ™ฉ์ด ์ž๋™ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค. ์ดํ›„ Spider๋ฅผ ๋‹ค์‹œ ์‹œ์ž‘ํ•  ๋•Œ ๋™์ผํ•œ `crawldir`์„ ์ „๋‹ฌํ•˜๋ฉด ์ค‘๋‹จ๋œ ์ง€์ ๋ถ€ํ„ฐ ์žฌ๊ฐœํ•ฉ๋‹ˆ๋‹ค.
### ๊ณ ๊ธ‰ ํŒŒ์‹ฑ & ๋‚ด๋น„๊ฒŒ์ด์…˜
```python
from scrapling.fetchers import Fetcher
# ํ’๋ถ€ํ•œ ์š”์†Œ ์„ ํƒ๊ณผ ๋‚ด๋น„๊ฒŒ์ด์…˜
page = Fetcher.get('https://quotes.toscrape.com/')
# ์—ฌ๋Ÿฌ ์„ ํƒ ๋ฉ”์„œ๋“œ๋กœ ์ธ์šฉ๊ตฌ ๊ฐ€์ ธ์˜ค๊ธฐ
quotes = page.css('.quote') # CSS selector
quotes = page.xpath('//div[@class="quote"]') # XPath
quotes = page.find_all('div', {'class': 'quote'}) # BeautifulSoup ์Šคํƒ€์ผ
# ์•„๋ž˜์™€ ๋™์ผ
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(['div'], class_='quote')
quotes = page.find_all(class_='quote') # ๋“ฑ๋“ฑ...
# ํ…์ŠคํŠธ ๋‚ด์šฉ์œผ๋กœ ์š”์†Œ ์ฐพ๊ธฐ
quotes = page.find_by_text('quote', tag='div')
# ๊ณ ๊ธ‰ ๋‚ด๋น„๊ฒŒ์ด์…˜
quote_text = page.css('.quote')[0].css('.text::text').get()
quote_text = page.css('.quote').css('.text::text').getall() # ์ฒด์ด๋‹ ์…€๋ ‰ํ„ฐ
first_quote = page.css('.quote')[0]
author = first_quote.next_sibling.css('.author::text')
parent_container = first_quote.parent
# ์š”์†Œ ๊ด€๊ณ„์™€ ์œ ์‚ฌ๋„
similar_elements = first_quote.find_similar()
below_elements = first_quote.below_elements()
```
์›น์‚ฌ์ดํŠธ๋ฅผ ๊ฐ€์ ธ์˜ค์ง€ ์•Š๊ณ  ํŒŒ์„œ๋ฅผ ๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค:
```python
from scrapling.parser import Selector
page = Selector("<html>...</html>")
```
์‚ฌ์šฉ๋ฒ•์€ ์™„์ „ํžˆ ๋™์ผํ•ฉ๋‹ˆ๋‹ค!
### ๋น„๋™๊ธฐ ์„ธ์…˜ ๊ด€๋ฆฌ ์˜ˆ์‹œ
```python
import asyncio
from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession
async with FetcherSession(http3=True) as session: # `FetcherSession`์€ ์ปจํ…์ŠคํŠธ ์ธ์‹์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ ๋™๊ธฐ/๋น„๋™๊ธฐ ํŒจํ„ด ๋ชจ๋‘์—์„œ ์ž‘๋™
page1 = session.get('https://quotes.toscrape.com/')
page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')
# ๋น„๋™๊ธฐ ์„ธ์…˜ ์‚ฌ์šฉ
async with AsyncStealthySession(max_pages=2) as session:
tasks = []
urls = ['https://example.com/page1', 'https://example.com/page2']
for url in urls:
task = session.fetch(url)
tasks.append(task)
print(session.get_pool_stats()) # ์„ ํƒ ์‚ฌํ•ญ - ๋ธŒ๋ผ์šฐ์ € ํƒญ ํ’€ ์ƒํƒœ (์‚ฌ์šฉ ์ค‘/์œ ํœด/์—๋Ÿฌ)
results = await asyncio.gather(*tasks)
print(session.get_pool_stats())
```
## CLI & ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ Shell
Scrapling์—๋Š” ๊ฐ•๋ ฅํ•œ ์ปค๋งจ๋“œ๋ผ์ธ ์ธํ„ฐํŽ˜์ด์Šค๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:
[![asciicast](https://asciinema.org/a/736339.svg)](https://asciinema.org/a/736339)
์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ Web Scraping Shell ์‹คํ–‰
```bash
scrapling shell
```
ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์—†์ด ํŽ˜์ด์ง€๋ฅผ ํŒŒ์ผ๋กœ ๋ฐ”๋กœ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค (๊ธฐ๋ณธ์ ์œผ๋กœ `body` ํƒœ๊ทธ ๋‚ด๋ถ€์˜ ์ฝ˜ํ…์ธ ๋ฅผ ์ถ”์ถœ). ์ถœ๋ ฅ ํŒŒ์ผ์ด `.txt`๋กœ ๋๋‚˜๋ฉด ๋Œ€์ƒ์˜ ํ…์ŠคํŠธ ์ฝ˜ํ…์ธ ๊ฐ€ ์ถ”์ถœ๋ฉ๋‹ˆ๋‹ค. `.md`๋กœ ๋๋‚˜๋ฉด HTML ์ฝ˜ํ…์ธ ์˜ Markdown ํ‘œํ˜„์ด ๋ฉ๋‹ˆ๋‹ค. `.html`๋กœ ๋๋‚˜๋ฉด HTML ์ฝ˜ํ…์ธ  ์ž์ฒด๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.
```bash
scrapling extract get 'https://example.com' content.md
scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # CSS selector '#fromSkipToProducts'์— ๋งค์นญ๋˜๋Š” ๋ชจ๋“  ์š”์†Œ
scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless
scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare
```
> [!NOTE]
> MCP ์„œ๋ฒ„์™€ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ Web Scraping Shell ๋“ฑ ๋” ๋งŽ์€ ๊ธฐ๋Šฅ์ด ์žˆ์ง€๋งŒ, ์ด ํŽ˜์ด์ง€๋Š” ๊ฐ„๊ฒฐํ•˜๊ฒŒ ์œ ์ง€ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์ „์ฒด ๋ฌธ์„œ๋Š” [์—ฌ๊ธฐ](https://scrapling.readthedocs.io/en/latest/)์—์„œ ํ™•์ธํ•˜์„ธ์š”.
## ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํฌ
Scrapling์€ ๊ฐ•๋ ฅํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ดˆ๊ณ ์†์ž…๋‹ˆ๋‹ค. ์•„๋ž˜ ๋ฒค์น˜๋งˆํฌ๋Š” Scrapling์˜ ํŒŒ์„œ๋ฅผ ๋‹ค๋ฅธ ์ธ๊ธฐ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ์ตœ์‹  ๋ฒ„์ „๊ณผ ๋น„๊ตํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
### ํ…์ŠคํŠธ ์ถ”์ถœ ์†๋„ ํ…Œ์ŠคํŠธ (5000๊ฐœ ์ค‘์ฒฉ ์š”์†Œ)
| # | Library | Time (ms) | vs Scrapling |
|---|:-----------------:|:---------:|:------------:|
| 1 | Scrapling | 2.02 | 1.0x |
| 2 | Parsel/Scrapy | 2.04 | 1.01 |
| 3 | Raw Lxml | 2.54 | 1.257 |
| 4 | PyQuery | 24.17 | ~12x |
| 5 | Selectolax | 82.63 | ~41x |
| 6 | MechanicalSoup | 1549.71 | ~767.1x |
| 7 | BS4 with Lxml | 1584.31 | ~784.3x |
| 8 | BS4 with html5lib | 3391.91 | ~1679.1x |
### ์š”์†Œ ์œ ์‚ฌ๋„ & ํ…์ŠคํŠธ ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ
Scrapling์˜ ์ ์‘ํ˜• ์š”์†Œ ์ฐพ๊ธฐ ๊ธฐ๋Šฅ์€ ๋Œ€์•ˆ๋“ค์„ ํฌ๊ฒŒ ์•ž์„ญ๋‹ˆ๋‹ค:
| Library | Time (ms) | vs Scrapling |
|-------------|:---------:|:------------:|
| Scrapling | 2.39 | 1.0x |
| AutoScraper | 12.45 | 5.209x |
> ๋ชจ๋“  ๋ฒค์น˜๋งˆํฌ๋Š” 100ํšŒ ์ด์ƒ ์‹คํ–‰์˜ ํ‰๊ท ์ž…๋‹ˆ๋‹ค. ์ธก์ • ๋ฐฉ๋ฒ•์€ [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py)๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
## ์„ค์น˜
Scrapling์€ Python 3.10 ์ด์ƒ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:
```bash
pip install scrapling
```
์ด ์„ค์น˜์—๋Š” ํŒŒ์„œ ์—”์ง„๊ณผ ์˜์กด์„ฑ๋งŒ ํฌํ•จ๋˜๋ฉฐ, Fetcher๋‚˜ ์ปค๋งจ๋“œ๋ผ์ธ ์˜์กด์„ฑ์€ ํฌํ•จ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
### ์„ ํƒ์  ์˜์กด์„ฑ
1. ์•„๋ž˜์˜ ์ถ”๊ฐ€ ๊ธฐ๋Šฅ, Fetcher, ๋˜๋Š” ๊ด€๋ จ ํด๋ž˜์Šค๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด Fetcher ์˜์กด์„ฑ๊ณผ ๋ธŒ๋ผ์šฐ์ € ์˜์กด์„ฑ์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:
```bash
pip install "scrapling[fetchers]"
scrapling install # ์ผ๋ฐ˜ ์„ค์น˜
scrapling install --force # ๊ฐ•์ œ ์žฌ์„ค์น˜
```
์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋ชจ๋“  ๋ธŒ๋ผ์šฐ์ €์™€ ์‹œ์Šคํ…œ ์˜์กด์„ฑ, fingerprint ์กฐ์ž‘ ์˜์กด์„ฑ์ด ๋‹ค์šด๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค.
๋˜๋Š” ๋ช…๋ น์–ด ๋Œ€์‹  ์ฝ”๋“œ์—์„œ ์„ค์น˜ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค:
```python
from scrapling.cli import install
install([], standalone_mode=False) # ์ผ๋ฐ˜ ์„ค์น˜
install(["--force"], standalone_mode=False) # ๊ฐ•์ œ ์žฌ์„ค์น˜
```
2. ์ถ”๊ฐ€ ๊ธฐ๋Šฅ:
- MCP ์„œ๋ฒ„ ๊ธฐ๋Šฅ ์„ค์น˜:
```bash
pip install "scrapling[ai]"
```
- Shell ๊ธฐ๋Šฅ (Web Scraping Shell ๋ฐ `extract` ๋ช…๋ น์–ด) ์„ค์น˜:
```bash
pip install "scrapling[shell]"
```
- ๋ชจ๋“  ๊ธฐ๋Šฅ ์„ค์น˜:
```bash
pip install "scrapling[all]"
```
์œ„ ์ถ”๊ฐ€ ๊ธฐ๋Šฅ์„ ์„ค์น˜ํ•œ ํ›„์—๋„ (์•„์ง ํ•˜์ง€ ์•Š์•˜๋‹ค๋ฉด) `scrapling install`๋กœ ๋ธŒ๋ผ์šฐ์ € ์˜์กด์„ฑ์„ ์„ค์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
### Docker
DockerHub์—์„œ ๋ชจ๋“  ์ถ”๊ฐ€ ๊ธฐ๋Šฅ๊ณผ ๋ธŒ๋ผ์šฐ์ €๊ฐ€ ํฌํ•จ๋œ Docker ์ด๋ฏธ์ง€๋ฅผ ์„ค์น˜ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค:
```bash
docker pull pyd4vinci/scrapling
```
๋˜๋Š” GitHub ๋ ˆ์ง€์ŠคํŠธ๋ฆฌ์—์„œ ๋‹ค์šด๋กœ๋“œ:
```bash
docker pull ghcr.io/d4vinci/scrapling:latest
```
์ด ์ด๋ฏธ์ง€๋Š” GitHub Actions์™€ ๋ ˆํฌ์ง€ํ† ๋ฆฌ์˜ main ๋ธŒ๋žœ์น˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ž๋™์œผ๋กœ ๋นŒ๋“œ ๋ฐ ํ‘ธ์‹œ๋ฉ๋‹ˆ๋‹ค.
## ๊ธฐ์—ฌํ•˜๊ธฐ
๊ธฐ์—ฌ๋ฅผ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค! ์‹œ์ž‘ํ•˜๊ธฐ ์ „์— [๊ธฐ์—ฌ ๊ฐ€์ด๋“œ๋ผ์ธ](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md)์„ ์ฝ์–ด์ฃผ์„ธ์š”.
## ๋ฉด์ฑ… ์กฐํ•ญ
> [!CAUTION]
> ์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ๊ต์œก ๋ฐ ์—ฐ๊ตฌ ๋ชฉ์ ์œผ๋กœ๋งŒ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค. ์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ, ๊ตญ๋‚ด์™ธ ๋ฐ์ดํ„ฐ ์Šคํฌ๋ ˆ์ดํ•‘ ๋ฐ ๊ฐœ์ธ์ •๋ณด ๋ณดํ˜ธ ๊ด€๋ จ ๋ฒ•๋ฅ ์„ ์ค€์ˆ˜ํ•˜๋Š” ๋ฐ ๋™์˜ํ•œ ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผ๋ฉ๋‹ˆ๋‹ค. ์ €์ž์™€ ๊ธฐ์—ฌ์ž๋Š” ์ด ์†Œํ”„ํŠธ์›จ์–ด์˜ ์˜ค์šฉ์— ๋Œ€ํ•ด ์ฑ…์ž„์ง€์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•ญ์ƒ ์›น์‚ฌ์ดํŠธ์˜ ์ด์šฉ์•ฝ๊ด€๊ณผ robots.txt ํŒŒ์ผ์„ ์กด์ค‘ํ•˜์„ธ์š”.
## ๐ŸŽ“ ์ธ์šฉ
์—ฐ๊ตฌ ๋ชฉ์ ์œผ๋กœ ์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์…จ๋‹ค๋ฉด, ์•„๋ž˜ ์ฐธ๊ณ  ๋ฌธํ—Œ์œผ๋กœ ์ธ์šฉํ•ด ์ฃผ์„ธ์š”:
```text
@misc{scrapling,
author = {Karim Shoair},
title = {Scrapling},
year = {2024},
url = {https://github.com/D4Vinci/Scrapling},
note = {An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!}
}
```
## ๋ผ์ด์„ ์Šค
์ด ํ”„๋กœ์ ํŠธ๋Š” BSD-3-Clause ๋ผ์ด์„ ์Šค ํ•˜์— ๋ฐฐํฌ๋ฉ๋‹ˆ๋‹ค.
## ๊ฐ์‚ฌ์˜ ๋ง
์ด ํ”„๋กœ์ ํŠธ์—๋Š” ๋‹ค์Œ์—์„œ ์ฐจ์šฉํ•œ ์ฝ”๋“œ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:
- Parsel (BSD ๋ผ์ด์„ ์Šค) โ€” [translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py) ์„œ๋ธŒ๋ชจ๋“ˆ์— ์‚ฌ์šฉ
---
<div align="center"><small>Karim Shoair๊ฐ€ โค๏ธ์œผ๋กœ ๋””์ž์ธํ•˜๊ณ  ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค.</small></div><br>