Spaces:

AUXteam
/

Scraper_hub

Sleeping

App Files Files Community

Scraper_hub / Scrapling /docs /spiders /getting-started.md

AUXteam

Upload folder using huggingface_hub

e840680 verified 28 days ago

preview code

raw

history blame contribute delete

6 kB

Getting started

Introduction

!!! success "Prerequisites"

1. You've completed or read the [Fetchers basics](../fetching/choosing.md) page to understand the different fetcher types and when to use each one.
2. You've completed or read the [Main classes](../parsing/main_classes.md) page to understand the [Selector](../parsing/main_classes.md#selector) and [Response](../fetching/choosing.md#response-object) classes.
3. You've read the [Architecture](architecture.md) page for a high-level overview of how the spider system works.

The spider system lets you build concurrent, multi-page crawlers in just a few lines of code. If you've used Scrapy before, the patterns will feel familiar. If not, this guide will walk you through everything you need to get started.

Your First Spider

A spider is a class that defines how to crawl and extract data from websites. Here's the simplest possible spider:

from scrapling.spiders import Spider, Response

class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com"]

    async def parse(self, response: Response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(""),
                "author": quote.css("small.author::text").get(""),
            }

Every spider needs three things:

name — A unique identifier for the spider.
start_urls — A list of URLs to start crawling from.
parse() — An async generator method that processes each response and yields results.

The parse() method is where the magic happens. You use the same selection methods you'd use with Scrapling's Selector/Response, and yield dictionaries to output scraped items.

Running the Spider

To run your spider, create an instance and call start():

result = QuotesSpider().start()

The start() method handles all the async machinery internally — no need to worry about event loops. While the spider is running, everything that happens is logged to the terminal, and at the end of the crawl, you get very detailed stats.

Those stats are in the returned CrawlResult object, which gives you everything you need:

result = QuotesSpider().start()

# Access scraped items
for item in result.items:
    print(item["text"], "-", item["author"])

# Check statistics
print(f"Scraped {result.stats.items_scraped} items")
print(f"Made {result.stats.requests_count} requests")
print(f"Took {result.stats.elapsed_seconds:.1f} seconds")

# Did the crawl finish or was it paused?
print(f"Completed: {result.completed}")

Following Links

Most crawls need to follow links across multiple pages. Use response.follow() to create follow-up requests:

from scrapling.spiders import Spider, Response

class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com"]

    async def parse(self, response: Response):
        # Extract items from the current page
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(""),
                "author": quote.css("small.author::text").get(""),
            }

        # Follow the "next page" link
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

response.follow() handles relative URLs automatically — it joins them with the current page's URL. It also sets the current page as the Referer header by default.

You can point follow-up requests at different callback methods for different page types:

async def parse(self, response: Response):
    for link in response.css("a.product-link::attr(href)").getall():
        yield response.follow(link, callback=self.parse_product)

async def parse_product(self, response: Response):
    yield {
        "name": response.css("h1::text").get(""),
        "price": response.css(".price::text").get(""),
    }

!!! note

All callback methods must be async generators (using `async def` and `yield`).

Exporting Data

The ItemList returned in result.items has built-in export methods:

result = QuotesSpider().start()

# Export as JSON
result.items.to_json("quotes.json")

# Export as JSON with pretty-printing
result.items.to_json("quotes.json", indent=True)

# Export as JSON Lines (one JSON object per line)
result.items.to_jsonl("quotes.jsonl")

Both methods create parent directories automatically if they don't exist.

Filtering Domains

Use allowed_domains to restrict the spider to specific domains. This prevents it from accidentally following links to external websites:

class MySpider(Spider):
    name = "my_spider"
    start_urls = ["https://example.com"]
    allowed_domains = {"example.com"}

    async def parse(self, response: Response):
        for link in response.css("a::attr(href)").getall():
            # Links to other domains are silently dropped
            yield response.follow(link, callback=self.parse)

Subdomains are matched automatically — setting allowed_domains = {"example.com"} also allows sub.example.com, blog.example.com, etc.

When a request is filtered out, it's counted in stats.offsite_requests_count so you can see how many were dropped.

What's Next

Now that you have the basics, you can explore:

Requests & Responses — learn about request priority, deduplication, metadata, and more.
Sessions — use multiple fetcher types (HTTP, browser, stealth) in a single spider.
Proxy management & blocking — rotate proxies across requests and how to handle blocking in the spider.
Advanced features — concurrency control, pause/resume, streaming, lifecycle hooks, and logging.