Spaces:

AUXteam
/

Scraper_hub

Sleeping

App Files Files Community

Scraper_hub / docs /overview.md

AUXteam

Upload folder using huggingface_hub

94ec243 verified 27 days ago

preview code

raw

history blame contribute delete

18.2 kB

Pick Your Path

Not sure where to start? Pick the path that matches what you're trying to do:

I want to...	Start here
Parse HTML I already have	Querying elements — CSS, XPath, and text-based selection
Quickly scrape a page and prototype	Pick a fetcher and test right away, or launch the interactive shell
Build a crawler that scales	Spiders — concurrent, multi-session crawls with pause/resume
Scrape without writing code	CLI extract commands or hook up the MCP server to your favourite AI tool
Migrate from another library	From BeautifulSoup or Scrapy comparison

We will start by quickly reviewing the parsing capabilities. Then we will fetch websites using custom browsers, make requests, and parse the responses.

Here's an HTML document generated by ChatGPT that we will be using as an example throughout this page:

<html>
  <head>
    <title>Complex Web Page</title>
    <style>
      .hidden { display: none; }
    </style>
  </head>
  <body>
    <header>
      <nav>
        <ul>
          <li> <a href="#home">Home</a> </li>
          <li> <a href="#about">About</a> </li>
          <li> <a href="#contact">Contact</a> </li>
        </ul>
      </nav>
    </header>
    <main>
      <section id="products" schema='{"jsonable": "data"}'>
        <h2>Products</h2>
        <div class="product-list">
          <article class="product" data-id="1">
            <h3>Product 1</h3>
            <p class="description">This is product 1</p>
            <span class="price">$10.99</span>
            <div class="hidden stock">In stock: 5</div>
          </article>

          <article class="product" data-id="2">
            <h3>Product 2</h3>
            <p class="description">This is product 2</p>
            <span class="price">$20.99</span>
            <div class="hidden stock">In stock: 3</div>
          </article>

          <article class="product" data-id="3">
            <h3>Product 3</h3>
            <p class="description">This is product 3</p>
            <span class="price">$15.99</span>
            <div class="hidden stock">Out of stock</div>
          </article>
        </div>
      </section>
      
      <section id="reviews">
        <h2>Customer Reviews</h2>
        <div class="review-list">
          <div class="review" data-rating="5">
            <p class="review-text">Great product!</p>
            <span class="reviewer">John Doe</span>
          </div>
          <div class="review" data-rating="4">
            <p class="review-text">Good value for money.</p>
            <span class="reviewer">Jane Smith</span>
          </div>
        </div>
      </section>
    </main>
    <script id="page-data" type="application/json">
      {
        "lastUpdated": "2024-09-22T10:30:00Z",
        "totalProducts": 3
      }
    </script>
  </body>
</html>

Starting with loading raw HTML above like this

from scrapling.parser import Selector
page = Selector(html_doc)
page  # <data='<html><head><title>Complex Web Page</tit...'>

Get all text content on the page recursively

page.get_all_text(ignore_tags=('script', 'style'))
# 'Complex Web Page\nHome\nAbout\nContact\nProducts\nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock\nCustomer Reviews\nGreat product!\nJohn Doe\nGood value for money.\nJane Smith'

Finding elements

If there's an element you want to find on the page, you will find it! Your creativity level is the only limitation!

Finding the first HTML section element

section_element = page.find('section')
# <data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>

Find all section elements

section_elements = page.find_all('section')
# [<data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>, <data='<section id="reviews"><h2>Customer Revie...' parent='<main><section id="products" schema='{"j...'>]

Find all section elements whose id attribute value is products.

section_elements = page.find_all('section', {'id':"products"})
# Same as
section_elements = page.find_all('section', id="products")
# [<data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>]

Find all section elements whose id attribute value contains product.

section_elements = page.find_all('section', {'id*':"product"})

Find all h3 elements whose text content matches this regex Product \d

page.find_all('h3', re.compile(r'Product \d'))
# [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>]

Find all h3 and h2 elements whose text content matches the regex Product only

page.find_all(['h3', 'h2'], re.compile(r'Product'))
# [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>, <data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>]

Find all elements whose text content matches exactly Products (Whitespaces are not taken into consideration)

page.find_by_text('Products', first_match=False)
# [<data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>]

Or find all elements whose text content matches regex Product \d

page.find_by_regex(r'Product \d', first_match=False)
# [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>]

Find all elements that are similar to the element you want

target_element = page.find_by_regex(r'Product \d', first_match=True)
# <data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>
target_element.find_similar()
# [<data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>]

Find the first element that matches a CSS selector

page.css('.product-list [data-id="1"]')[0]
# <data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>

Find all elements that match a CSS selector

page.css('.product-list article')
# [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>]

Find the first element that matches an XPath selector

page.xpath("//*[@id='products']/div/article")[0]
# <data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>

Find all elements that match an XPath selector

page.xpath("//*[@id='products']/div/article")
# [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>]

With this, we just scratched the surface of these functions; more advanced options with these selection methods are shown later.

Accessing elements' data

It's as simple as

>>> section_element.tag
'section'
>>> print(section_element.attrib)
{'id': 'products', 'schema': '{"jsonable": "data"}'}
>>> section_element.attrib['schema'].json()  # If an attribute value can be converted to json, then use `.json()` to convert it
{'jsonable': 'data'}
>>> section_element.text  # Direct text content
''
>>> section_element.get_all_text()  # All text content recursively
'Products\nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock'
>>> section_element.html_content  # The HTML content of the element
'<section id="products" schema=\'{"jsonable": "data"}\'><h2>Products</h2>\n        <div class="product-list">\n          <article class="product" data-id="1"><h3>Product 1</h3>\n            <p class="description">This is product 1</p>\n            <span class="price">$10.99</span>\n            <div class="hidden stock">In stock: 5</div>\n          </article><article class="product" data-id="2"><h3>Product 2</h3>\n            <p class="description">This is product 2</p>\n            <span class="price">$20.99</span>\n            <div class="hidden stock">In stock: 3</div>\n          </article><article class="product" data-id="3"><h3>Product 3</h3>\n            <p class="description">This is product 3</p>\n            <span class="price">$15.99</span>\n            <div class="hidden stock">Out of stock</div>\n          </article></div>\n      </section>'
>>> print(section_element.prettify())  # The prettified version
'''
<section id="products" schema='{"jsonable": "data"}'><h2>Products</h2>
    <div class="product-list">
      <article class="product" data-id="1"><h3>Product 1</h3>
        <p class="description">This is product 1</p>
        <span class="price">$10.99</span>
        <div class="hidden stock">In stock: 5</div>
      </article><article class="product" data-id="2"><h3>Product 2</h3>
        <p class="description">This is product 2</p>
        <span class="price">$20.99</span>
        <div class="hidden stock">In stock: 3</div>
      </article><article class="product" data-id="3"><h3>Product 3</h3>
        <p class="description">This is product 3</p>
        <span class="price">$15.99</span>
        <div class="hidden stock">Out of stock</div>
      </article>
    </div>
</section>
'''
>>> section_element.path  # All the ancestors in the DOM tree of this element
[<data='<main><section id="products" schema='{"j...' parent='<body> <header><nav><ul><li> <a href="#h...'>,
 <data='<body> <header><nav><ul><li> <a href="#h...' parent='<html><head><title>Complex Web Page</tit...'>,
 <data='<html><head><title>Complex Web Page</tit...'>]
>>> section_element.generate_css_selector
'#products'
>>> section_element.generate_full_css_selector
'body > main > #products > #products'
>>> section_element.generate_xpath_selector
"//*[@id='products']"
>>> section_element.generate_full_xpath_selector
"//body/main/*[@id='products']"

Navigation

Using the elements we found above

>>> section_element.parent
<data='<main><section id="products" schema='{"j...' parent='<body> <header><nav><ul><li> <a href="#h...'>
>>> section_element.parent.tag
'main'
>>> section_element.parent.parent.tag
'body'
>>> section_element.children
[<data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>,
 <data='<div class="product-list"> <article clas...' parent='<section id="products" schema='{"jsonabl...'>]
>>> section_element.siblings
[<data='<section id="reviews"><h2>Customer Revie...' parent='<main><section id="products" schema='{"j...'>]
>>> section_element.next  # gets the next element, the same logic applies to `quote.previous`.
<data='<section id="reviews"><h2>Customer Revie...' parent='<main><section id="products" schema='{"j...'>
>>> section_element.children.css('h2::text').getall()
['Products']
>>> page.css('[data-id="1"]')[0].has_class('product')
True

If your case needs more than the element's parent, you can iterate over the whole ancestors' tree of any element, like the one below

for ancestor in section_element.iterancestors():
    # do something with it...

You can search for a specific ancestor of an element that satisfies a function; all you need to do is pass a function that takes a Selector object as an argument and returns True if the condition is satisfied or False otherwise, like below:

>>> section_element.find_ancestor(lambda ancestor: ancestor.css('nav'))
<data='<body> <header><nav><ul><li> <a href="#h...' parent='<html><head><title>Complex Web Page</tit...'>

Fetching websites

Instead of passing the raw HTML to Scrapling, you can retrieve a website's response directly via HTTP requests or by fetching it in a browser.

A fetcher is made for every use case.

HTTP Requests

For simple HTTP requests, there's a Fetcher class that can be imported and used as below:

from scrapling.fetchers import Fetcher
page = Fetcher.get('https://scrapling.requestcatcher.com/get', impersonate="chrome")

With that out of the way, here's how to do all HTTP methods:

>>> from scrapling.fetchers import Fetcher
>>> page = Fetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True, follow_redirects=True)
>>> page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
>>> page = Fetcher.put('https://scrapling.requestcatcher.com/put', data={'key': 'value'})
>>> page = Fetcher.delete('https://scrapling.requestcatcher.com/delete')

For Async requests, you will replace the import like below:

>>> from scrapling.fetchers import AsyncFetcher
>>> page = await AsyncFetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True, follow_redirects=True)
>>> page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
>>> page = await AsyncFetcher.put('https://scrapling.requestcatcher.com/put', data={'key': 'value'})
>>> page = await AsyncFetcher.delete('https://scrapling.requestcatcher.com/delete')

!!! note "Notes:"

1. You have the `stealthy_headers` argument, which, when enabled, makes requests to generate real browser headers and use them, including a referer header, as if this request came from a Google search of this domain. It's enabled by default.
2. The `impersonate` argument lets you fake the TLS fingerprint for a specific browser version.
3. There's also the `http3` argument, which, when enabled, makes the fetcher use HTTP/3 for requests, which makes your requests more authentic

This is just the tip of the iceberg with this fetcher; check out the rest from here

Dynamic loading

We have you covered if you deal with dynamic websites like most today!

The DynamicFetcher class (formerly PlayWrightFetcher) offers many options for fetching and loading web pages using Chromium-based browsers.

>>> from scrapling.fetchers import DynamicFetcher
>>> page = DynamicFetcher.fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True)  # Vanilla Playwright option
>>> page.css("#search a::attr(href)").get()
'https://github.com/D4Vinci/Scrapling'
>>> # The async version of fetch
>>> page = await DynamicFetcher.async_fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True)
>>> page.css("#search a::attr(href)").get()
'https://github.com/D4Vinci/Scrapling'

It's built on top of Playwright, and it's currently providing two main run options that can be mixed as you want:

Vanilla Playwright without any modifications other than the ones you chose. It uses the Chromium browser.
Real browsers like your Chrome browser by passing the real_chrome argument or the CDP URL of your browser to be controlled by the Fetcher, and most of the options can be enabled on it.

Again, this is just the tip of the iceberg with this fetcher. Check out the rest from here for all details and the complete list of arguments.

Dynamic anti-protection loading

We also have you covered if you deal with dynamic websites with annoying anti-protections!

The StealthyFetcher class uses a stealthy version of the DynamicFetcher explained above.

Some of the things it does:

It easily bypasses all types of Cloudflare's Turnstile/Interstitial automatically.
It bypasses CDP runtime leaks and WebRTC leaks.
It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
It generates canvas noise to prevent fingerprinting through canvas.
It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
It makes requests look as if they came from Google's search page of the requested website.
and other anti-protection options...

>>> from scrapling.fetchers import StealthyFetcher
>>> page = StealthyFetcher.fetch('https://www.browserscan.net/bot-detection')  # Running headless by default
>>> page.status == 200
True
>>> page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare', solve_cloudflare=True)  # Solve Cloudflare captcha automatically if presented
>>> page.status == 200
True
>>> page = StealthyFetcher.fetch('https://www.browserscan.net/bot-detection', humanize=True, os_randomize=True) # and the rest of arguments...
>>> # The async version of fetch
>>> page = await StealthyFetcher.async_fetch('https://www.browserscan.net/bot-detection')
>>> page.status == 200
True

Again, this is just the tip of the iceberg with this fetcher. Check out the rest from here for all details and the complete list of arguments.

That's Scrapling at a glance. If you want to learn more, continue to the next section.