| <!-- mcp-name: io.github.D4Vinci/Scrapling --> |
|
|
| <h1 align="center"> |
| <a href="https://scrapling.readthedocs.io"> |
| <picture> |
| <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_dark.svg?sanitize=true"> |
| <img alt="Scrapling Poster" src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_light.svg?sanitize=true"> |
| </picture> |
| </a> |
| <br> |
| <small>Effortless Web Scraping for the Modern Web</small> |
| </h1> |
| |
| <p align="center"> |
| <a href="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml" alt="Tests"> |
| <img alt="Tests" src="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg"></a> |
| <a href="https://badge.fury.io/py/Scrapling" alt="PyPI version"> |
| <img alt="PyPI version" src="https://badge.fury.io/py/Scrapling.svg"></a> |
| <a href="https://clickpy.clickhouse.com/dashboard/scrapling" rel="nofollow"><img src="https://img.shields.io/pypi/dm/scrapling" alt="PyPI package downloads"></a> |
| <a href="https://github.com/D4Vinci/Scrapling/tree/main/agent-skill" alt="AI Agent Skill directory"> |
| <img alt="Static Badge" src="https://img.shields.io/badge/Skill-black?style=flat&label=Agent&link=https%3A%2F%2Fgithub.com%2FD4Vinci%2FScrapling%2Ftree%2Fmain%2Fagent-skill"></a> |
| <a href="https://clawhub.ai/D4Vinci/scrapling-official" alt="OpenClaw Skill"> |
| <img alt="OpenClaw Skill" src="https://img.shields.io/badge/Clawhub-darkred?style=flat&label=OpenClaw&link=https%3A%2F%2Fclawhub.ai%2FD4Vinci%2Fscrapling-official"></a> |
| <br/> |
| <a href="https://discord.gg/EMgGbDceNQ" alt="Discord" target="_blank"> |
| <img alt="Discord" src="https://img.shields.io/discord/1360786381042880532?style=social&logo=discord&link=https%3A%2F%2Fdiscord.gg%2FEMgGbDceNQ"> |
| </a> |
| <a href="https://x.com/Scrapling_dev" alt="X (formerly Twitter)"> |
| <img alt="X (formerly Twitter) Follow" src="https://img.shields.io/twitter/follow/Scrapling_dev?style=social&logo=x&link=https%3A%2F%2Fx.com%2FScrapling_dev"> |
| </a> |
| <br/> |
| <a href="https://pypi.org/project/scrapling/" alt="Supported Python versions"> |
| <img alt="Supported Python versions" src="https://img.shields.io/pypi/pyversions/scrapling.svg"></a> |
| </p> |
| |
| <p align="center"> |
| <a href="https://scrapling.readthedocs.io/en/latest/parsing/selection.html"><strong>์ ํ ๋ฉ์๋</strong></a> |
| · |
| <a href="https://scrapling.readthedocs.io/en/latest/fetching/choosing.html"><strong>Fetcher ์ ํ ๊ฐ์ด๋</strong></a> |
| · |
| <a href="https://scrapling.readthedocs.io/en/latest/spiders/architecture.html"><strong>Spider</strong></a> |
| · |
| <a href="https://scrapling.readthedocs.io/en/latest/spiders/proxy-blocking.html"><strong>ํ๋ก์ ๋กํ
์ด์
</strong></a> |
| · |
| <a href="https://scrapling.readthedocs.io/en/latest/cli/overview.html"><strong>CLI</strong></a> |
| · |
| <a href="https://scrapling.readthedocs.io/en/latest/ai/mcp-server.html"><strong>MCP ์๋ฒ</strong></a> |
| </p> |
| |
| Scrapling์ ๋จ์ผ ์์ฒญ๋ถํฐ ๋๊ท๋ชจ ํฌ๋กค๋ง๊น์ง ๋ชจ๋ ๊ฒ์ ์ฒ๋ฆฌํ๋ ์ ์ํ Web Scraping ํ๋ ์์ํฌ์
๋๋ค. |
|
|
| ํ์๋ ์น์ฌ์ดํธ ๋ณ๊ฒฝ ์ฌํญ์ ํ์ตํ๊ณ , ํ์ด์ง๊ฐ ์
๋ฐ์ดํธ๋๋ฉด ์์๋ฅผ ์๋์ผ๋ก ์ฌ๋ฐฐ์นํฉ๋๋ค. Fetcher๋ Cloudflare Turnstile ๊ฐ์ ์ํฐ๋ด ์์คํ
์ ๋ณ๋ ์ค์ ์์ด ์ฐํํฉ๋๋ค. Spider ํ๋ ์์ํฌ๋ฅผ ์ฌ์ฉํ๋ฉด ์ผ์์ ์ง/์ฌ๊ฐ ๋ฐ ์๋ ํ๋ก์ ๋กํ
์ด์
์ ๊ฐ์ถ ๋์ ๋ฉํฐ ์ธ์
ํฌ๋กค๋ง์ผ๋ก ํ์ฅํ ์ ์์ต๋๋ค โ ๋ชจ๋ Python ๋ช ์ค์ด๋ฉด ๋ฉ๋๋ค. ํ๋์ ๋ผ์ด๋ธ๋ฌ๋ฆฌ, ํํ ์๋ ์ฑ๋ฅ. |
|
|
| ์ค์๊ฐ ํต๊ณ์ ์คํธ๋ฆฌ๋ฐ์ ํตํ ์ด๊ณ ์ ํฌ๋กค๋ง. Web Scraper๊ฐ ๋ง๋ค๊ณ , Web Scraper์ ์ผ๋ฐ ์ฌ์ฉ์ ๋ชจ๋๋ฅผ ์ํด ์ค๊ณํ์ต๋๋ค. |
|
|
| ```python |
| from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher |
| StealthyFetcher.adaptive = True |
| p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # ํ์ง๋ฅผ ํผํด ์น์ฌ์ดํธ๋ฅผ ๊ฐ์ ธ์ต๋๋ค! |
| products = p.css('.product', auto_save=True) # ์น์ฌ์ดํธ ๋์์ธ ๋ณ๊ฒฝ์๋ ์ด์๋จ๋ ๋ฐ์ดํฐ๋ฅผ ์คํฌ๋ ์ดํ! |
| products = p.css('.product', adaptive=True) # ๋์ค์ ์น์ฌ์ดํธ ๊ตฌ์กฐ๊ฐ ๋ฐ๋๋ฉด, `adaptive=True`๋ฅผ ์ ๋ฌํด์ ์ฐพ์ผ์ธ์! |
| ``` |
| ๋๋ ๋ณธ๊ฒฉ์ ์ธ ํฌ๋กค๋ง์ผ๋ก ํ์ฅ |
| ```python |
| from scrapling.spiders import Spider, Response |
| |
| class MySpider(Spider): |
| name = "demo" |
| start_urls = ["https://example.com/"] |
| |
| async def parse(self, response: Response): |
| for item in response.css('.product'): |
| yield {"title": item.css('h2::text').get()} |
| |
| MySpider().start() |
| ``` |
|
|
| <p align="center"> |
| <a href="https://dataimpulse.com/?utm_source=scrapling&utm_medium=banner&utm_campaign=scrapling" target="_blank" style="display:flex; justify-content:center; padding:4px 0;"> |
| <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/DataImpulse.png" alt="At DataImpulse, we specialize in developing custom proxy services for your business. Make requests from anywhere, collect data, and enjoy fast connections with our premium proxies." style="max-height:60px;"> |
| </a> |
| </p> |
| |
| # ํ๋ํฐ๋ ์คํฐ์ |
| <table> |
| <tr> |
| <td width="200"> |
| <a href="https://hypersolutions.co/?utm_source=github&utm_medium=readme&utm_campaign=scrapling" target="_blank" title="Bot Protection Bypass API for Akamai, DataDome, Incapsula & Kasada"> |
| <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/HyperSolutions.png"> |
| </a> |
| </td> |
| <td> Scrapling์ Cloudflare Turnstile์ ์ฒ๋ฆฌํฉ๋๋ค. ์ํฐํ๋ผ์ด์ฆ๊ธ ๋ณดํธ๊ฐ ํ์ํ๋ค๋ฉด, <a href="https://hypersolutions.co?utm_source=github&utm_medium=readme&utm_campaign=scrapling"> |
| <b>Hyper Solutions</b> |
| </a>๊ฐ <b>Akamai</b>, <b>DataDome</b>, <b>Kasada</b>, <b>Incapsula</b>์ฉ ์ ํจํ ์ํฐ๋ด ํ ํฐ์ ์์ฑํ๋ API ์๋ํฌ์ธํธ๋ฅผ ์ ๊ณตํฉ๋๋ค. ๊ฐ๋จํ API ํธ์ถ๋ง์ผ๋ก, ๋ธ๋ผ์ฐ์ ์๋ํ๊ฐ ํ์ ์์ต๋๋ค. </td> |
| </tr> |
| <tr> |
| <td width="200"> |
| <a href="https://birdproxies.com/t/scrapling" target="_blank" title="At Bird Proxies, we eliminate your pains such as banned IPs, geo restriction, and high costs so you can focus on your work."> |
| <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/BirdProxies.jpg"> |
| </a> |
| </td> |
| <td>ํ๋ก์๋ ๋ณต์กํ๊ฑฐ๋ ๋น์ ์ด์ ๊ฐ ์๋ค๋ ์๊ฐ์ผ๋ก <a href="https://birdproxies.com/t/scrapling"> |
| <b>BirdProxies</b> |
| </a>๋ฅผ ๋ง๋ค์์ต๋๋ค. <br /> 195๊ฐ ์ด์ ์ง์ญ์ ๋น ๋ฅธ ๋ ์ง๋ด์
๋ฐ ISP ํ๋ก์, ํฉ๋ฆฌ์ ์ธ ๊ฐ๊ฒฉ, ์ค์ง์ ์ธ ์ง์. <br /> |
| <b>๋๋ฉ ํ์ด์ง์์ FlappyBird ๊ฒ์์ ํ๋ ์ดํ๊ณ ๋ฌด๋ฃ ๋ฐ์ดํฐ๋ฅผ ๋ฐ์ผ์ธ์!</b> |
| </td> |
| </tr> |
| <tr> |
| <td width="200"> |
| <a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling" target="_blank" title="Evomi is your Swiss Quality Proxy Provider, starting at $0.49/GB"> |
| <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/evomi.png"> |
| </a> |
| </td> |
| <td> |
| <a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling"> |
| <b>Evomi</b> |
| </a>: ๋ ์ง๋ด์
ํ๋ก์ GB๋น $0.49๋ถํฐ. ์์ ํ ์์ฅ๋ Chromium ์คํฌ๋ ์ดํ ๋ธ๋ผ์ฐ์ , ๋ ์ง๋ด์
IP, ์๋ CAPTCHA ํด๊ฒฐ, ์ํฐ๋ด ์ฐํ.</br> |
| <b>Scraper API๋ก ๋ฒ๊ฑฐ๋ก์ ์์ด ๊ฒฐ๊ณผ๋ฅผ ์ป์ผ์ธ์. MCP ๋ฐ N8N ํตํฉ ์ง์.</b> |
| </td> |
| </tr> |
| <tr> |
| <td width="200"> |
| <a href="https://tikhub.io/?ref=KarimShoair" target="_blank" title="Unlock the Power of Social Media Data & AI"> |
| <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/TikHub.jpg"> |
| </a> |
| </td> |
| <td> |
| <a href="https://tikhub.io/?ref=KarimShoair" target="_blank">TikHub.io</a>๋ TikTok, X, YouTube, Instagram ๋ฑ 16๊ฐ ์ด์ ํ๋ซํผ์์ 900๊ฐ ์ด์์ ์์ ์ ์ธ API๋ฅผ ์ ๊ณตํ๋ฉฐ, 4,000๋ง ์ด์์ ๋ฐ์ดํฐ์
์ ๋ณด์ ํ๊ณ ์์ต๋๋ค. <br /> <a href="https://ai.tikhub.io/?ref=KarimShoair" target="_blank">ํ ์ธ๋ AI ๋ชจ๋ธ</a>๋ ์ ๊ณต โ Claude, GPT, GEMINI ๋ฑ ์ต๋ 71% ํ ์ธ. |
| </td> |
| </tr> |
| <tr> |
| <td width="200"> |
| <a href="https://www.nsocks.com/?keyword=2p67aivg" target="_blank" title="Scalable Web Data Access for AI Applications"> |
| <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/nsocks.png"> |
| </a> |
| </td> |
| <td> |
| <a href="https://www.nsocks.com/?keyword=2p67aivg" target="_blank">Nsocks</a>๋ ๊ฐ๋ฐ์์ ์คํฌ๋ ์ดํผ๋ฅผ ์ํ ๋น ๋ฅธ ๋ ์ง๋ด์
๋ฐ ISP ํ๋ก์๋ฅผ ์ ๊ณตํฉ๋๋ค. ๊ธ๋ก๋ฒ IP ์ปค๋ฒ๋ฆฌ์ง, ๋์ ์ต๋ช
์ฑ, ์ค๋งํธ ๋กํ
์ด์
, ์๋ํ์ ๋ฐ์ดํฐ ์ถ์ถ์ ์ํ ์์ ์ ์ธ ์ฑ๋ฅ. <a href="https://www.xcrawl.com/?keyword=2p67aivg" target="_blank">Xcrawl</a>๋ก ๋๊ท๋ชจ ์น ํฌ๋กค๋ง์ ๊ฐ์ํํ์ธ์. |
| </td> |
| </tr> |
| <tr> |
| <td width="200"> |
| <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."> |
| <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"> |
| </a> |
| </td> |
| <td> |
| ๋
ธํธ๋ถ์ ๋ซ์ผ์ธ์. ์คํฌ๋ํผ๋ ๊ณ์ ์๋ํฉ๋๋ค. <br /> |
| <a href="https://petrosky.io/d4vinci" target="_blank">PetroSky VPS</a> - ๋
ผ์คํฑ ์๋ํ๋ฅผ ์ํ ํด๋ผ์ฐ๋ ์๋ฒ. Windows ๋ฐ Linux ๋จธ์ ์ ์๋ฒฝํ๊ฒ ์ ์ด. ์ โฌ6.99๋ถํฐ. |
| </td> |
| </tr> |
| <tr> |
| <td width="200"> |
| <a href="https://substack.thewebscraping.club/p/scrapling-hands-on-guide?utm_source=github&utm_medium=repo&utm_campaign=scrapling" target="_blank" title="The #1 newsletter dedicated to Web Scraping"> |
| <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/TWSC.png"> |
| </a> |
| </td> |
| <td> |
| <a href="https://substack.thewebscraping.club/p/scrapling-hands-on-guide?utm_source=github&utm_medium=repo&utm_campaign=scrapling" target="_blank">The Web Scraping Club์์ Scrapling์ ์ ์ฒด ๋ฆฌ๋ทฐ</a>(2025๋
11์)๋ฅผ ์ฝ์ด๋ณด์ธ์. ์น ์คํฌ๋ํ ์ ๋ฌธ No.1 ๋ด์ค๋ ํฐ์
๋๋ค. |
| </td> |
| </tr> |
| <tr> |
| <td width="200"> |
| <a href="https://proxy-seller.com/?partner=CU9CAA5TBYFFT2" target="_blank" title="Proxy-Seller provides reliable proxy infrastructure for Web Scraping"> |
| <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxySeller.png"> |
| </a> |
| </td> |
| <td> |
| <a href="https://proxy-seller.com/?partner=CU9CAA5TBYFFT2" target="_blank">Proxy-Seller</a>๋ ์น ์คํฌ๋ํ์ ์ํ ์์ ์ ์ธ ํ๋ก์ ์ธํ๋ผ๋ฅผ ์ ๊ณตํฉ๋๋ค. IPv4, IPv6, ISP, ์ฃผ๊ฑฐ์ฉ ๋ฐ ๋ชจ๋ฐ์ผ ํ๋ก์๋ฅผ ์ง์ํ๋ฉฐ, ์์ ์ ์ธ ์ฑ๋ฅ, ๊ด๋ฒ์ํ ์ง์ญ ์ปค๋ฒ๋ฆฌ์ง, ๊ธฐ์
๊ท๋ชจ์ ๋ฐ์ดํฐ ์์ง์ ์ํ ์ ์ฐํ ์๊ธ์ ๋ฅผ ๊ฐ์ถ๊ณ ์์ต๋๋ค. |
| </td> |
| </tr> |
| </table> |
| |
| <i><sub>์ฌ๊ธฐ์ ๊ด๊ณ ๋ฅผ ๊ฒ์ฌํ๊ณ ์ถ์ผ์ ๊ฐ์? [์ฌ๊ธฐ](https://github.com/sponsors/D4Vinci/sponsorships?tier_id=586646)๋ฅผ ํด๋ฆญํ์ธ์</sub></i> |
| # ์คํฐ์ |
|
|
| <!-- sponsors --> |
|
|
| <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a> |
| <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a> |
| <a href="https://hasdata.com/?utm_source=github&utm_medium=banner&utm_campaign=D4Vinci" target="_blank" title="The web scraping service that actually beats anti-bot systems!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/hasdata.png"></a> |
| <a href="https://proxyempire.io/?ref=scrapling&utm_source=scrapling" target="_blank" title="Collect The Data Your Project Needs with the Best Residential Proxies"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxyEmpire.png"></a> |
| <a href="https://www.webshare.io/?referral_code=48r2m2cd5uz1" target="_blank" title="The Most Reliable Proxy with Unparalleled Performance"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/webshare.png"></a> |
| <a href="https://browser.cash/?utm_source=D4Vinci&utm_medium=referral" target="_blank" title="Browser Automation & AI Browser Agent Platform"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/browserCash.png"></a> |
|
|
| <!-- /sponsors --> |
|
|
| <i><sub>์ฌ๊ธฐ์ ๊ด๊ณ ๋ฅผ ๊ฒ์ฌํ๊ณ ์ถ์ผ์ ๊ฐ์? [์ฌ๊ธฐ](https://github.com/sponsors/D4Vinci)๋ฅผ ํด๋ฆญํ๊ณ ์ํ๋ ํฐ์ด๋ฅผ ์ ํํ์ธ์!</sub></i> |
|
|
| --- |
|
|
| ## ์ฃผ์ ๊ธฐ๋ฅ |
|
|
| ### Spider โ ๋ณธ๊ฒฉ์ ์ธ ํฌ๋กค๋ง ํ๋ ์์ํฌ |
| - ๐ท๏ธ **Scrapy ์คํ์ผ Spider API**: `start_urls`, ๋น๋๊ธฐ `parse` ์ฝ๋ฐฑ, `Request`/`Response` ๊ฐ์ฒด๋ก Spider๋ฅผ ์ ์ํฉ๋๋ค. |
| - โก **๋์ ํฌ๋กค๋ง**: ์ค์ ๊ฐ๋ฅํ ๋์ ์์ฒญ ์ ์ ํ, ๋๋ฉ์ธ๋ณ ์ค๋กํ๋ง, ๋ค์ด๋ก๋ ๋๋ ์ด๋ฅผ ์ง์ํฉ๋๋ค. |
| - ๐ **๋ฉํฐ ์ธ์
์ง์**: HTTP ์์ฒญ๊ณผ ์คํ
์ค ํค๋๋ฆฌ์ค ๋ธ๋ผ์ฐ์ ๋ฅผ ํ๋์ ์ธํฐํ์ด์ค๋ก ํตํฉ โ ID๋ก ์์ฒญ์ ๋ค๋ฅธ ์ธ์
์ ๋ผ์ฐํ
ํฉ๋๋ค. |
| - ๐พ **์ผ์์ ์ง & ์ฌ๊ฐ**: ์ฒดํฌํฌ์ธํธ ๊ธฐ๋ฐ์ ํฌ๋กค๋ง ์์ํ. Ctrl+C๋ก ์ ์ ์ข
๋ฃํ๊ณ , ์ฌ์์ํ๋ฉด ์ค๋จ๋ ์ง์ ๋ถํฐ ์ด์ด๊ฐ๋๋ค. |
| - ๐ก **์คํธ๋ฆฌ๋ฐ ๋ชจ๋**: `async for item in spider.stream()`์ผ๋ก ์คํฌ๋ ์ดํ๋ ์์ดํ
์ ์ค์๊ฐ ํต๊ณ์ ํจ๊ป ์คํธ๋ฆฌ๋ฐ์ผ๋ก ์์ โ UI, ํ์ดํ๋ผ์ธ, ์ฅ์๊ฐ ํฌ๋กค๋ง์ ์ ํฉํฉ๋๋ค. |
| - ๐ก๏ธ **์ฐจ๋จ๋ ์์ฒญ ๊ฐ์ง**: ์ปค์คํ
๋ก์ง์ ํตํ ์ฐจ๋จ๋ ์์ฒญ์ ์๋ ๊ฐ์ง ๋ฐ ์ฌ์๋๋ฅผ ์ง์ํฉ๋๋ค. |
| - ๐ฆ **๋ด์ฅ ๋ด๋ณด๋ด๊ธฐ**: ํ
์ด๋ ์์ฒด ํ์ดํ๋ผ์ธ, ๋๋ ๋ด์ฅ JSON/JSONL๋ก ๊ฒฐ๊ณผ๋ฅผ ๋ด๋ณด๋
๋๋ค. ๊ฐ๊ฐ `result.items.to_json()` / `result.items.to_jsonl()`์ ์ฌ์ฉํฉ๋๋ค. |
|
|
| ### ์ธ์
์ ์ง์ํ๋ ๊ณ ๊ธ ์น์ฌ์ดํธ ๊ฐ์ ธ์ค๊ธฐ |
| - **HTTP ์์ฒญ**: `Fetcher` ํด๋์ค๋ก ๋น ๋ฅด๊ณ ์๋ฐํ HTTP ์์ฒญ. ๋ธ๋ผ์ฐ์ ์ TLS fingerprint, ํค๋๋ฅผ ๋ชจ๋ฐฉํ๊ณ , HTTP/3๋ฅผ ์ฌ์ฉํ ์ ์์ต๋๋ค. |
| - **๋์ ๋ก๋ฉ**: Playwright์ Chromium๊ณผ Google Chrome์ ์ง์ํ๋ `DynamicFetcher` ํด๋์ค๋ก ์์ ํ ๋ธ๋ผ์ฐ์ ์๋ํ๋ฅผ ํตํด ๋์ ์น์ฌ์ดํธ๋ฅผ ๊ฐ์ ธ์ต๋๋ค. |
| - **์ํฐ๋ด ์ฐํ**: `StealthyFetcher`์ fingerprint ์์ฅ์ ํตํ ๊ณ ๊ธ ์คํ
์ค ๊ธฐ๋ฅ. ์๋ํ๋ก ๋ชจ๋ ์ ํ์ Cloudflare Turnstile/Interstitial์ ์์ฝ๊ฒ ์ฐํํฉ๋๋ค. |
| - **์ธ์
๊ด๋ฆฌ**: `FetcherSession`, `StealthySession`, `DynamicSession` ํด๋์ค๋ก ์์ฒญ ๊ฐ ์ฟ ํค์ ์ํ๋ฅผ ๊ด๋ฆฌํ๋ ์์์ ์ธ์
์ ์ง์ํฉ๋๋ค. |
| - **ํ๋ก์ ๋กํ
์ด์
**: ๋ชจ๋ ์ธ์
ํ์
์ ๋์ํ๋ ์ํ ๋๋ ์ปค์คํ
์ ๋ต์ ๋ด์ฅ `ProxyRotator`์ ์์ฒญ๋ณ ํ๋ก์ ์ค๋ฒ๋ผ์ด๋๋ฅผ ์ ๊ณตํฉ๋๋ค. |
| - **๋๋ฉ์ธ ์ฐจ๋จ**: ๋ธ๋ผ์ฐ์ ๊ธฐ๋ฐ Fetcher์์ ํน์ ๋๋ฉ์ธ(๋ฐ ํ์ ๋๋ฉ์ธ)์ผ๋ก์ ์์ฒญ์ ์ฐจ๋จํฉ๋๋ค. |
| - **๋น๋๊ธฐ ์ง์**: ๋ชจ๋ Fetcher์ ์ ์ฉ ๋น๋๊ธฐ ์ธ์
ํด๋์ค์์ ์์ ํ ๋น๋๊ธฐ๋ฅผ ์ง์ํฉ๋๋ค. |
|
|
| ### ์ ์ํ ์คํฌ๋ ์ดํ & AI ํตํฉ |
| - ๐ **์ค๋งํธ ์์ ์ถ์ **: ์ง๋ฅ์ ์ธ ์ ์ฌ๋ ์๊ณ ๋ฆฌ์ฆ์ผ๋ก ์น์ฌ์ดํธ ๋ณ๊ฒฝ ํ์๋ ์์๋ฅผ ์ฌ๋ฐฐ์นํฉ๋๋ค. |
| - ๐ฏ **์ ์ฐํ ์ค๋งํธ ์ ํ**: CSS selector, XPath selector, ํํฐ ๊ธฐ๋ฐ ๊ฒ์, ํ
์คํธ ๊ฒ์, ์ ๊ท์ ๊ฒ์ ๋ฑ์ ์ง์ํฉ๋๋ค. |
| - ๐ **์ ์ฌ ์์ ์ฐพ๊ธฐ**: ๋ฐ๊ฒฌ๋ ์์์ ์ ์ฌํ ์์๋ฅผ ์๋์ผ๋ก ์ฐพ์๋
๋๋ค. |
| - ๐ค **AI์ ํจ๊ป ์ฌ์ฉํ๋ MCP ์๋ฒ**: AI ๊ธฐ๋ฐ Web Scraping๊ณผ ๋ฐ์ดํฐ ์ถ์ถ์ ์ํ ๋ด์ฅ MCP ์๋ฒ. AI(Claude/Cursor ๋ฑ)์ ์ ๋ฌํ๊ธฐ ์ ์ Scrapling์ ํ์ฉํด ๋์ ์ฝํ
์ธ ๋ฅผ ์ถ์ถํ๋ ๊ฐ๋ ฅํ ์ปค์คํ
๊ธฐ๋ฅ์ ๊ฐ์ถ๊ณ ์์ด, ์์
์๋๋ฅผ ๋์ด๊ณ ํ ํฐ ์ฌ์ฉ๋์ ์ต์ํํด ๋น์ฉ์ ์ ๊ฐํฉ๋๋ค. ([๋ฐ๋ชจ ์์](https://www.youtube.com/watch?v=qyFk3ZNwOxE)) |
|
|
| ### ๊ณ ์ฑ๋ฅ & ์ค์ ๊ฒ์ฆ๋ ์ํคํ
์ฒ |
| - ๐ **์ด๊ณ ์**: ๋๋ถ๋ถ์ Python ์คํฌ๋ ์ดํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ๋ฅ๊ฐํ๋ ์ต์ ํ๋ ์ฑ๋ฅ. |
| - ๐ **๋ฉ๋ชจ๋ฆฌ ํจ์จ**: ์ต์ ํ๋ ๋ฐ์ดํฐ ๊ตฌ์กฐ์ ์ง์ฐ ๋ก๋ฉ์ผ๋ก ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ์ ์ต์ํํฉ๋๋ค. |
| - โก **๊ณ ์ JSON ์ง๋ ฌํ**: ํ์ค ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ณด๋ค 10๋ฐฐ ๋น ๋ฆ
๋๋ค. |
| - ๐๏ธ **์ค์ ๊ฒ์ฆ**: Scrapling์ 92%์ ํ
์คํธ ์ปค๋ฒ๋ฆฌ์ง์ ์์ ํ ํ์
ํํธ ์ปค๋ฒ๋ฆฌ์ง๋ฅผ ๊ฐ์ถ๊ณ ์์ ๋ฟ ์๋๋ผ, ์ง๋ 1๋
๊ฐ ์๋ฐฑ ๋ช
์ Web Scraper๊ฐ ๋งค์ผ ์ฌ์ฉํด ์์ต๋๋ค. |
|
|
| ### ๊ฐ๋ฐ์/Web Scraper ์นํ์ ๊ฒฝํ |
| - ๐ฏ **์ธํฐ๋ํฐ๋ธ Web Scraping Shell**: Scrapling ํตํฉ, ๋จ์ถํค, curl ์์ฒญ์ Scrapling ์์ฒญ์ผ๋ก ๋ณํํ๊ฑฐ๋ ๋ธ๋ผ์ฐ์ ์์ ์์ฒญ ๊ฒฐ๊ณผ๋ฅผ ํ์ธํ๋ ๋ฑ์ ๋๊ตฌ๋ฅผ ๊ฐ์ถ ์ ํ์ ๋ด์ฅ IPython Shell๋ก, Web Scraping ์คํฌ๋ฆฝํธ ๊ฐ๋ฐ์ ๊ฐ์ํฉ๋๋ค. |
| - ๐ **ํฐ๋ฏธ๋์์ ๋ฐ๋ก ์ฌ์ฉ**: ์ฝ๋ ํ ์ค ์์ด Scrapling์ผ๋ก URL์ ์คํฌ๋ ์ดํํ ์ ์์ต๋๋ค! |
| - ๐ ๏ธ **ํ๋ถํ ๋ด๋น๊ฒ์ด์
API**: ๋ถ๋ชจ, ํ์ , ์์ ํ์ ๋ฉ์๋๋ฅผ ํตํ ๊ณ ๊ธ DOM ์ํ๋ฅผ ์ง์ํฉ๋๋ค. |
| - ๐งฌ **ํฅ์๋ ํ
์คํธ ์ฒ๋ฆฌ**: ๋ด์ฅ ์ ๊ท์, ํด๋ฆฌ๋ ๋ฉ์๋, ์ต์ ํ๋ ๋ฌธ์์ด ์ฐ์ฐ์ ์ ๊ณตํฉ๋๋ค. |
| - ๐ **์๋ ์
๋ ํฐ ์์ฑ**: ๋ชจ๋ ์์์ ๋ํด ๊ฒฌ๊ณ ํ CSS/XPath selector๋ฅผ ์์ฑํฉ๋๋ค. |
| - ๐ **์ต์ํ API**: Scrapy/Parsel์์ ์ฌ์ฉํ๋ ๊ฒ๊ณผ ๋์ผํ ์์ฌ ์์(pseudo-element)๋ฅผ ๊ฐ์ง Scrapy/BeautifulSoup ์คํ์ผ์ API. |
| - ๐ **์์ ํ ํ์
์ปค๋ฒ๋ฆฌ์ง**: ๋ฐ์ด๋ IDE ์ง์๊ณผ ์ฝ๋ ์๋์์ฑ์ ์ํ ์์ ํ ํ์
ํํธ. ์ฝ๋๋ฒ ์ด์ค ์ ์ฒด๊ฐ ๋ณ๊ฒฝ๋ ๋๋ง๋ค **PyRight**์ **MyPy**๋ก ์๋ ๊ฒ์ฌ๋ฉ๋๋ค. |
| - ๐ **๋ฐ๋ก ์ฌ์ฉ ๊ฐ๋ฅํ Docker ์ด๋ฏธ์ง**: ๋งค ๋ฆด๋ฆฌ์ค๋ง๋ค ๋ชจ๋ ๋ธ๋ผ์ฐ์ ๋ฅผ ํฌํจํ Docker ์ด๋ฏธ์ง๊ฐ ์๋์ผ๋ก ๋น๋ ๋ฐ ํธ์๋ฉ๋๋ค. |
|
|
| ## ์์ํ๊ธฐ |
|
|
| ๊น์ด ๋ค์ด๊ฐ์ง ์๊ณ , Scrapling์ด ํ ์ ์๋ ๊ฒ๋ค์ ๊ฐ๋จํ ์ดํด๋ณด๊ฒ ์ต๋๋ค. |
|
|
| ### ๊ธฐ๋ณธ ์ฌ์ฉ๋ฒ |
| ์ธ์
์ ์ง์ํ๋ HTTP ์์ฒญ |
| ```python |
| from scrapling.fetchers import Fetcher, FetcherSession |
| |
| with FetcherSession(impersonate='chrome') as session: # Chrome์ ์ต์ TLS fingerprint ์ฌ์ฉ |
| page = session.get('https://quotes.toscrape.com/', stealthy_headers=True) |
| quotes = page.css('.quote .text::text').getall() |
| |
| # ๋๋ ์ผํ์ฑ ์์ฒญ ์ฌ์ฉ |
| page = Fetcher.get('https://quotes.toscrape.com/') |
| quotes = page.css('.quote .text::text').getall() |
| ``` |
| ๊ณ ๊ธ ์คํ
์ค ๋ชจ๋ |
| ```python |
| from scrapling.fetchers import StealthyFetcher, StealthySession |
| |
| with StealthySession(headless=True, solve_cloudflare=True) as session: # ์์
์ด ๋๋ ๋๊น์ง ๋ธ๋ผ์ฐ์ ๋ฅผ ์ด์ด๋ก๋๋ค |
| page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False) |
| data = page.css('#padded_content a').getall() |
| |
| # ๋๋ ์ผํ์ฑ ์์ฒญ ์คํ์ผ โ ์ด ์์ฒญ์ ์ํด ๋ธ๋ผ์ฐ์ ๋ฅผ ์ด๊ณ , ์๋ฃ ํ ๋ซ์ต๋๋ค |
| page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare') |
| data = page.css('#padded_content a').getall() |
| ``` |
| ์์ ํ ๋ธ๋ผ์ฐ์ ์๋ํ |
| ```python |
| from scrapling.fetchers import DynamicFetcher, DynamicSession |
| |
| with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # ์์
์ด ๋๋ ๋๊น์ง ๋ธ๋ผ์ฐ์ ๋ฅผ ์ด์ด๋ก๋๋ค |
| page = session.fetch('https://quotes.toscrape.com/', load_dom=False) |
| data = page.xpath('//span[@class="text"]/text()').getall() # ์ํ์๋ฉด XPath selector๋ ์ฌ์ฉ ๊ฐ๋ฅ |
| |
| # ๋๋ ์ผํ์ฑ ์์ฒญ ์คํ์ผ โ ์ด ์์ฒญ์ ์ํด ๋ธ๋ผ์ฐ์ ๋ฅผ ์ด๊ณ , ์๋ฃ ํ ๋ซ์ต๋๋ค |
| page = DynamicFetcher.fetch('https://quotes.toscrape.com/') |
| data = page.css('.quote .text::text').getall() |
| ``` |
|
|
| ### Spider |
| ๋์ ์์ฒญ, ์ฌ๋ฌ ์ธ์
ํ์
, ์ผ์์ ์ง & ์ฌ๊ฐ๋ฅผ ๊ฐ์ถ ๋ณธ๊ฒฉ์ ์ธ ํฌ๋กค๋ฌ ๊ตฌ์ถ: |
| ```python |
| from scrapling.spiders import Spider, Request, Response |
| |
| class QuotesSpider(Spider): |
| name = "quotes" |
| start_urls = ["https://quotes.toscrape.com/"] |
| concurrent_requests = 10 |
| |
| async def parse(self, response: Response): |
| for quote in response.css('.quote'): |
| yield { |
| "text": quote.css('.text::text').get(), |
| "author": quote.css('.author::text').get(), |
| } |
| |
| next_page = response.css('.next a') |
| if next_page: |
| yield response.follow(next_page[0].attrib['href']) |
| |
| result = QuotesSpider().start() |
| print(f"{len(result.items)}๊ฐ์ ์ธ์ฉ๊ตฌ๋ฅผ ์คํฌ๋ ์ดํํ์ต๋๋ค") |
| result.items.to_json("quotes.json") |
| ``` |
| ํ๋์ Spider์์ ์ฌ๋ฌ ์ธ์
ํ์
์ฌ์ฉ: |
| ```python |
| from scrapling.spiders import Spider, Request, Response |
| from scrapling.fetchers import FetcherSession, AsyncStealthySession |
| |
| class MultiSessionSpider(Spider): |
| name = "multi" |
| start_urls = ["https://example.com/"] |
| |
| def configure_sessions(self, manager): |
| manager.add("fast", FetcherSession(impersonate="chrome")) |
| manager.add("stealth", AsyncStealthySession(headless=True), lazy=True) |
| |
| async def parse(self, response: Response): |
| for link in response.css('a::attr(href)').getall(): |
| # ๋ณดํธ๋ ํ์ด์ง๋ ์คํ
์ค ์ธ์
์ ํตํด ๋ผ์ฐํ
|
| if "protected" in link: |
| yield Request(link, sid="stealth") |
| else: |
| yield Request(link, sid="fast", callback=self.parse) # ๋ช
์์ ์ฝ๋ฐฑ |
| ``` |
| ์ฒดํฌํฌ์ธํธ๋ฅผ ์ฌ์ฉํด ์ฅ์๊ฐ ํฌ๋กค๋ง์ ์ผ์์ ์ง & ์ฌ๊ฐ: |
| ```python |
| QuotesSpider(crawldir="./crawl_data").start() |
| ``` |
| Ctrl+C๋ฅผ ๋๋ฅด๋ฉด ์ ์์ ์ผ๋ก ์ผ์์ ์ง๋๊ณ , ์งํ ์ํฉ์ด ์๋ ์ ์ฅ๋ฉ๋๋ค. ์ดํ Spider๋ฅผ ๋ค์ ์์ํ ๋ ๋์ผํ `crawldir`์ ์ ๋ฌํ๋ฉด ์ค๋จ๋ ์ง์ ๋ถํฐ ์ฌ๊ฐํฉ๋๋ค. |
|
|
| ### ๊ณ ๊ธ ํ์ฑ & ๋ด๋น๊ฒ์ด์
|
| ```python |
| from scrapling.fetchers import Fetcher |
| |
| # ํ๋ถํ ์์ ์ ํ๊ณผ ๋ด๋น๊ฒ์ด์
|
| page = Fetcher.get('https://quotes.toscrape.com/') |
| |
| # ์ฌ๋ฌ ์ ํ ๋ฉ์๋๋ก ์ธ์ฉ๊ตฌ ๊ฐ์ ธ์ค๊ธฐ |
| quotes = page.css('.quote') # CSS selector |
| quotes = page.xpath('//div[@class="quote"]') # XPath |
| quotes = page.find_all('div', {'class': 'quote'}) # BeautifulSoup ์คํ์ผ |
| # ์๋์ ๋์ผ |
| quotes = page.find_all('div', class_='quote') |
| quotes = page.find_all(['div'], class_='quote') |
| quotes = page.find_all(class_='quote') # ๋ฑ๋ฑ... |
| # ํ
์คํธ ๋ด์ฉ์ผ๋ก ์์ ์ฐพ๊ธฐ |
| quotes = page.find_by_text('quote', tag='div') |
| |
| # ๊ณ ๊ธ ๋ด๋น๊ฒ์ด์
|
| quote_text = page.css('.quote')[0].css('.text::text').get() |
| quote_text = page.css('.quote').css('.text::text').getall() # ์ฒด์ด๋ ์
๋ ํฐ |
| first_quote = page.css('.quote')[0] |
| author = first_quote.next_sibling.css('.author::text') |
| parent_container = first_quote.parent |
| |
| # ์์ ๊ด๊ณ์ ์ ์ฌ๋ |
| similar_elements = first_quote.find_similar() |
| below_elements = first_quote.below_elements() |
| ``` |
| ์น์ฌ์ดํธ๋ฅผ ๊ฐ์ ธ์ค์ง ์๊ณ ํ์๋ฅผ ๋ฐ๋ก ์ฌ์ฉํ ์๋ ์์ต๋๋ค: |
| ```python |
| from scrapling.parser import Selector |
| |
| page = Selector("<html>...</html>") |
| ``` |
| ์ฌ์ฉ๋ฒ์ ์์ ํ ๋์ผํฉ๋๋ค! |
|
|
| ### ๋น๋๊ธฐ ์ธ์
๊ด๋ฆฌ ์์ |
| ```python |
| import asyncio |
| from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession |
| |
| async with FetcherSession(http3=True) as session: # `FetcherSession`์ ์ปจํ
์คํธ ์ธ์์ด ๊ฐ๋ฅํ๋ฉฐ ๋๊ธฐ/๋น๋๊ธฐ ํจํด ๋ชจ๋์์ ์๋ |
| page1 = session.get('https://quotes.toscrape.com/') |
| page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135') |
| |
| # ๋น๋๊ธฐ ์ธ์
์ฌ์ฉ |
| async with AsyncStealthySession(max_pages=2) as session: |
| tasks = [] |
| urls = ['https://example.com/page1', 'https://example.com/page2'] |
| |
| for url in urls: |
| task = session.fetch(url) |
| tasks.append(task) |
| |
| print(session.get_pool_stats()) # ์ ํ ์ฌํญ - ๋ธ๋ผ์ฐ์ ํญ ํ ์ํ (์ฌ์ฉ ์ค/์ ํด/์๋ฌ) |
| results = await asyncio.gather(*tasks) |
| print(session.get_pool_stats()) |
| ``` |
|
|
| ## CLI & ์ธํฐ๋ํฐ๋ธ Shell |
|
|
| Scrapling์๋ ๊ฐ๋ ฅํ ์ปค๋งจ๋๋ผ์ธ ์ธํฐํ์ด์ค๊ฐ ํฌํจ๋์ด ์์ต๋๋ค: |
|
|
| [](https://asciinema.org/a/736339) |
|
|
| ์ธํฐ๋ํฐ๋ธ Web Scraping Shell ์คํ |
| ```bash |
| scrapling shell |
| ``` |
| ํ๋ก๊ทธ๋๋ฐ ์์ด ํ์ด์ง๋ฅผ ํ์ผ๋ก ๋ฐ๋ก ์ถ์ถํฉ๋๋ค (๊ธฐ๋ณธ์ ์ผ๋ก `body` ํ๊ทธ ๋ด๋ถ์ ์ฝํ
์ธ ๋ฅผ ์ถ์ถ). ์ถ๋ ฅ ํ์ผ์ด `.txt`๋ก ๋๋๋ฉด ๋์์ ํ
์คํธ ์ฝํ
์ธ ๊ฐ ์ถ์ถ๋ฉ๋๋ค. `.md`๋ก ๋๋๋ฉด HTML ์ฝํ
์ธ ์ Markdown ํํ์ด ๋ฉ๋๋ค. `.html`๋ก ๋๋๋ฉด HTML ์ฝํ
์ธ ์์ฒด๊ฐ ๋ฉ๋๋ค. |
| ```bash |
| scrapling extract get 'https://example.com' content.md |
| scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # CSS selector '#fromSkipToProducts'์ ๋งค์นญ๋๋ ๋ชจ๋ ์์ |
| scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless |
| scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare |
| ``` |
|
|
| > [!NOTE] |
| > MCP ์๋ฒ์ ์ธํฐ๋ํฐ๋ธ Web Scraping Shell ๋ฑ ๋ ๋ง์ ๊ธฐ๋ฅ์ด ์์ง๋ง, ์ด ํ์ด์ง๋ ๊ฐ๊ฒฐํ๊ฒ ์ ์งํ๊ฒ ์ต๋๋ค. ์ ์ฒด ๋ฌธ์๋ [์ฌ๊ธฐ](https://scrapling.readthedocs.io/en/latest/)์์ ํ์ธํ์ธ์. |
|
|
| ## ์ฑ๋ฅ ๋ฒค์น๋งํฌ |
|
|
| Scrapling์ ๊ฐ๋ ฅํ ๋ฟ๋ง ์๋๋ผ ์ด๊ณ ์์
๋๋ค. ์๋ ๋ฒค์น๋งํฌ๋ Scrapling์ ํ์๋ฅผ ๋ค๋ฅธ ์ธ๊ธฐ ๋ผ์ด๋ธ๋ฌ๋ฆฌ์ ์ต์ ๋ฒ์ ๊ณผ ๋น๊ตํ ๊ฒ์
๋๋ค. |
|
|
| ### ํ
์คํธ ์ถ์ถ ์๋ ํ
์คํธ (5000๊ฐ ์ค์ฒฉ ์์) |
|
|
| | # | Library | Time (ms) | vs Scrapling | |
| |---|:-----------------:|:---------:|:------------:| |
| | 1 | Scrapling | 2.02 | 1.0x | |
| | 2 | Parsel/Scrapy | 2.04 | 1.01 | |
| | 3 | Raw Lxml | 2.54 | 1.257 | |
| | 4 | PyQuery | 24.17 | ~12x | |
| | 5 | Selectolax | 82.63 | ~41x | |
| | 6 | MechanicalSoup | 1549.71 | ~767.1x | |
| | 7 | BS4 with Lxml | 1584.31 | ~784.3x | |
| | 8 | BS4 with html5lib | 3391.91 | ~1679.1x | |
|
|
|
|
| ### ์์ ์ ์ฌ๋ & ํ
์คํธ ๊ฒ์ ์ฑ๋ฅ |
|
|
| Scrapling์ ์ ์ํ ์์ ์ฐพ๊ธฐ ๊ธฐ๋ฅ์ ๋์๋ค์ ํฌ๊ฒ ์์ญ๋๋ค: |
|
|
| | Library | Time (ms) | vs Scrapling | |
| |-------------|:---------:|:------------:| |
| | Scrapling | 2.39 | 1.0x | |
| | AutoScraper | 12.45 | 5.209x | |
|
|
|
|
| > ๋ชจ๋ ๋ฒค์น๋งํฌ๋ 100ํ ์ด์ ์คํ์ ํ๊ท ์
๋๋ค. ์ธก์ ๋ฐฉ๋ฒ์ [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py)๋ฅผ ์ฐธ์กฐํ์ธ์. |
|
|
| ## ์ค์น |
|
|
| Scrapling์ Python 3.10 ์ด์์ด ํ์ํฉ๋๋ค: |
|
|
| ```bash |
| pip install scrapling |
| ``` |
|
|
| ์ด ์ค์น์๋ ํ์ ์์ง๊ณผ ์์กด์ฑ๋ง ํฌํจ๋๋ฉฐ, Fetcher๋ ์ปค๋งจ๋๋ผ์ธ ์์กด์ฑ์ ํฌํจ๋์ง ์์ต๋๋ค. |
|
|
| ### ์ ํ์ ์์กด์ฑ |
|
|
| 1. ์๋์ ์ถ๊ฐ ๊ธฐ๋ฅ, Fetcher, ๋๋ ๊ด๋ จ ํด๋์ค๋ฅผ ์ฌ์ฉํ๋ ค๋ฉด Fetcher ์์กด์ฑ๊ณผ ๋ธ๋ผ์ฐ์ ์์กด์ฑ์ ๋ค์๊ณผ ๊ฐ์ด ์ค์นํด์ผ ํฉ๋๋ค: |
| ```bash |
| pip install "scrapling[fetchers]" |
| |
| scrapling install # ์ผ๋ฐ ์ค์น |
| scrapling install --force # ๊ฐ์ ์ฌ์ค์น |
| ``` |
| |
| ์ด๋ ๊ฒ ํ๋ฉด ๋ชจ๋ ๋ธ๋ผ์ฐ์ ์ ์์คํ
์์กด์ฑ, fingerprint ์กฐ์ ์์กด์ฑ์ด ๋ค์ด๋ก๋๋ฉ๋๋ค. |
| |
| ๋๋ ๋ช
๋ น์ด ๋์ ์ฝ๋์์ ์ค์นํ ์๋ ์์ต๋๋ค: |
| ```python |
| from scrapling.cli import install |
| |
| install([], standalone_mode=False) # ์ผ๋ฐ ์ค์น |
| install(["--force"], standalone_mode=False) # ๊ฐ์ ์ฌ์ค์น |
| ``` |
| |
| 2. ์ถ๊ฐ ๊ธฐ๋ฅ: |
| - MCP ์๋ฒ ๊ธฐ๋ฅ ์ค์น: |
| ```bash |
| pip install "scrapling[ai]" |
| ``` |
| - Shell ๊ธฐ๋ฅ (Web Scraping Shell ๋ฐ `extract` ๋ช
๋ น์ด) ์ค์น: |
| ```bash |
| pip install "scrapling[shell]" |
| ``` |
| - ๋ชจ๋ ๊ธฐ๋ฅ ์ค์น: |
| ```bash |
| pip install "scrapling[all]" |
| ``` |
| ์ ์ถ๊ฐ ๊ธฐ๋ฅ์ ์ค์นํ ํ์๋ (์์ง ํ์ง ์์๋ค๋ฉด) `scrapling install`๋ก ๋ธ๋ผ์ฐ์ ์์กด์ฑ์ ์ค์นํด์ผ ํฉ๋๋ค. |
| |
| ### Docker |
| DockerHub์์ ๋ชจ๋ ์ถ๊ฐ ๊ธฐ๋ฅ๊ณผ ๋ธ๋ผ์ฐ์ ๊ฐ ํฌํจ๋ Docker ์ด๋ฏธ์ง๋ฅผ ์ค์นํ ์๋ ์์ต๋๋ค: |
| ```bash |
| docker pull pyd4vinci/scrapling |
| ``` |
| ๋๋ GitHub ๋ ์ง์คํธ๋ฆฌ์์ ๋ค์ด๋ก๋: |
| ```bash |
| docker pull ghcr.io/d4vinci/scrapling:latest |
| ``` |
| ์ด ์ด๋ฏธ์ง๋ GitHub Actions์ ๋ ํฌ์งํ ๋ฆฌ์ main ๋ธ๋์น๋ฅผ ์ฌ์ฉํ์ฌ ์๋์ผ๋ก ๋น๋ ๋ฐ ํธ์๋ฉ๋๋ค. |
|
|
| ## ๊ธฐ์ฌํ๊ธฐ |
|
|
| ๊ธฐ์ฌ๋ฅผ ํ์ํฉ๋๋ค! ์์ํ๊ธฐ ์ ์ [๊ธฐ์ฌ ๊ฐ์ด๋๋ผ์ธ](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md)์ ์ฝ์ด์ฃผ์ธ์. |
|
|
| ## ๋ฉด์ฑ
์กฐํญ |
|
|
| > [!CAUTION] |
| > ์ด ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ ๊ต์ก ๋ฐ ์ฐ๊ตฌ ๋ชฉ์ ์ผ๋ก๋ง ์ ๊ณต๋ฉ๋๋ค. ์ด ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ์ฌ์ฉํจ์ผ๋ก์จ, ๊ตญ๋ด์ธ ๋ฐ์ดํฐ ์คํฌ๋ ์ดํ ๋ฐ ๊ฐ์ธ์ ๋ณด ๋ณดํธ ๊ด๋ จ ๋ฒ๋ฅ ์ ์ค์ํ๋ ๋ฐ ๋์ํ ๊ฒ์ผ๋ก ๊ฐ์ฃผ๋ฉ๋๋ค. ์ ์์ ๊ธฐ์ฌ์๋ ์ด ์ํํธ์จ์ด์ ์ค์ฉ์ ๋ํด ์ฑ
์์ง์ง ์์ต๋๋ค. ํญ์ ์น์ฌ์ดํธ์ ์ด์ฉ์ฝ๊ด๊ณผ robots.txt ํ์ผ์ ์กด์คํ์ธ์. |
|
|
| ## ๐ ์ธ์ฉ |
| ์ฐ๊ตฌ ๋ชฉ์ ์ผ๋ก ์ด ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ์ฌ์ฉํ์
จ๋ค๋ฉด, ์๋ ์ฐธ๊ณ ๋ฌธํ์ผ๋ก ์ธ์ฉํด ์ฃผ์ธ์: |
| ```text |
| @misc{scrapling, |
| author = {Karim Shoair}, |
| title = {Scrapling}, |
| year = {2024}, |
| url = {https://github.com/D4Vinci/Scrapling}, |
| note = {An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!} |
| } |
| ``` |
|
|
| ## ๋ผ์ด์ ์ค |
|
|
| ์ด ํ๋ก์ ํธ๋ BSD-3-Clause ๋ผ์ด์ ์ค ํ์ ๋ฐฐํฌ๋ฉ๋๋ค. |
|
|
| ## ๊ฐ์ฌ์ ๋ง |
|
|
| ์ด ํ๋ก์ ํธ์๋ ๋ค์์์ ์ฐจ์ฉํ ์ฝ๋๊ฐ ํฌํจ๋์ด ์์ต๋๋ค: |
| - Parsel (BSD ๋ผ์ด์ ์ค) โ [translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py) ์๋ธ๋ชจ๋์ ์ฌ์ฉ |
|
|
| --- |
| <div align="center"><small>Karim Shoair๊ฐ โค๏ธ์ผ๋ก ๋์์ธํ๊ณ ๋ง๋ค์์ต๋๋ค.</small></div><br> |
|
|