Spaces:

lenson78
/

Scrapling

Paused

App Files Files Community

Scrapling / docs /parsing /adaptive.md

Karim shoair

docs: general correction across all the pages

4329d45 about 1 month ago

preview code

raw

history blame contribute delete

14 kB

Adaptive scraping

!!! success "Prerequisites"

1. You've completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector) object.
2. You've completed or read the [Main classes](../parsing/main_classes.md) page to understand the [Selector](../parsing/main_classes.md#selector) class.

Adaptive scraping (previously known as automatch) is one of Scrapling's most powerful features. It allows your scraper to survive website changes by intelligently tracking and relocating elements.

Let's say you are scraping a page with a structure like this:

<div class="container">
    <section class="products">
        <article class="product" id="p1">
            <h3>Product 1</h3>
            <p class="description">Description 1</p>
        </article>
        <article class="product" id="p2">
            <h3>Product 2</h3>
            <p class="description">Description 2</p>
        </article>
    </section>
</div>

And you want to scrape the first product, the one with the p1 ID. You will probably write a selector like this

page.css('#p1')

When website owners implement structural changes like

<div class="new-container">
    <div class="product-wrapper">
        <section class="products">
            <article class="product new-class" data-id="p1">
                <div class="product-info">
                    <h3>Product 1</h3>
                    <p class="new-description">Description 1</p>
                </div>
            </article>
            <article class="product new-class" data-id="p2">
                <div class="product-info">
                    <h3>Product 2</h3>
                    <p class="new-description">Description 2</p>
                </div>
            </article>
        </section>
    </div>
</div>

The selector will no longer function, and your code needs maintenance. That's where Scrapling's adaptive feature comes into play.

With Scrapling, you can enable the adaptive feature the first time you select an element, and the next time you select that element and it doesn't exist, Scrapling will remember its properties and search on the website for the element with the highest percentage of similarity to that element, and without AI :)

from scrapling import Selector, Fetcher
# Before the change
page = Selector(page_source, adaptive=True, url='example.com')
# or
Fetcher.adaptive = True
page = Fetcher.get('https://example.com')
# then
element = page.css('#p1', auto_save=True)
if not element:  # One day website changes?
    element = page.css('#p1', adaptive=True)  # Scrapling still finds it!
# the rest of your code...

Below, I will show you an example of how to use this feature. Then, we will dive deep into how to use it and provide details about this feature. Note that it works with all selection methods, not just CSS/XPATH selection.

Real-World Scenario

Let's use a real website as an example and use one of the fetchers to fetch its source. To achieve this, we need to identify a website that is about to update its design/structure, copy its source, and then wait for the website to change. Of course, that's nearly impossible to know unless I know the website's owner, but that will make it a staged test, haha.

To solve this issue, I will use The Web Archive's Wayback Machine. Here is a copy of StackOverFlow's website in 2010; pretty old, eh?
Let's see if the adaptive feature can extract the same button in the old design from 2010 and the current design using the same selector :)

If I want to extract the Questions button from the old design, I can use a selector like this: #hmenus > div:nth-child(1) > ul > li:nth-child(1) > a. This selector is too specific because it was generated by Google Chrome.

Now, let's test the same selector in both versions

>> from scrapling import Fetcher
>> selector = '#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a'
>> old_url = "https://web.archive.org/web/20100102003420/http://stackoverflow.com/"
>> new_url = "https://stackoverflow.com/"
>> Fetcher.configure(adaptive = True, adaptive_domain='stackoverflow.com')
>> 
>> page = Fetcher.get(old_url, timeout=30)
>> element1 = page.css(selector, auto_save=True)[0]
>> 
>> # Same selector but used in the updated website
>> page = Fetcher.get(new_url)
>> element2 = page.css(selector, adaptive=True)[0]
>> 
>> if element1.text == element2.text:
...    print('Scrapling found the same element in the old and new designs!')
'Scrapling found the same element in the old and new designs!'

Note that I introduced a new argument called adaptive_domain. This is because, for Scrapling, these are two different domains (archive.org and stackoverflow.com), so Scrapling will isolate their adaptive data. To inform Scrapling that they are the same website, we must pass the custom domain we wish to use while saving adaptive data for both, ensuring Scrapling doesn't isolate them.

The code will be the same in a real-world scenario, except it will use the same URL for both requests, so you won't need to use the adaptive_domain argument. This is the closest example I can give to real-world cases, so I hope it didn't confuse you :)

Hence, in the two examples above, I used both the Selector and Fetcher classes to show that the adaptive logic is the same.

!!! info

The main reason for creating the `adaptive_domain` argument was to handle if the website changed its URL while changing the design/structure. In that case, you can use it to continue using the previously stored adaptive data for the new URL. Otherwise, scrapling will consider it a new website and discard the old data.

How the adaptive scraping feature works

Adaptive scraping works in two phases:

Save Phase: Store unique properties of elements
Match Phase: Find elements with similar properties later

Let's say you've selected an element through any method and want the library to find it the next time you scrape this website, even if it undergoes structural/design changes.

With as few technical details as possible, the general logic goes as follows:

You tell Scrapling to save that element's unique properties in one of the ways we will show below.
Scrapling uses its configured database (SQLite by default) and saves each element's unique properties.
Now, because everything about the element can be changed or removed by the website's owner(s), nothing from the element can be used as a unique identifier for the database. To solve this issue, I made the storage system rely on two things:
1. The domain of the current website. If you are using the Selector class, pass it when initializing; if you are using a fetcher, the domain will be automatically taken from the URL.
2. An identifier to query that element's properties from the database. You don't always have to set the identifier yourself; we'll discuss this later.
Together, they will later be used to retrieve the element's unique properties from the database.
Later, when the website's structure changes, you tell Scrapling to find the element by enabling adaptive. Scrapling retrieves the element's unique properties and matches all elements on the page against the unique properties we already have for this element. A score is calculated based on their similarity to the desired element. In that comparison, everything is taken into consideration, as you will see later
The element(s) with the highest similarity score to the wanted element are returned.

The unique properties

You might wonder what unique properties we are referring to when discussing the removal or alteration of all element properties.

For Scrapling, the unique elements we are relying on are:

Element tag name, text, attributes (names and values), siblings (tag names only), and path (tag names only).
Element's parent tag name, attributes (names and values), and text.

But you need to understand that the comparison between elements isn't exact; it's more about how similar these values are. So everything is considered, even the values' order, like the order in which the element class names were written before and the order in which the same element class names are written now.

How to use adaptive feature

The adaptive feature can be applied to any found element, and it's added as arguments to CSS/XPath Selection methods, as you saw above, but we will get back to that later.

First, you must enable the adaptive feature by passing adaptive=True to the Selector class when you initialize it or enable it in the fetcher you are using of the available fetchers, as we will show.

Examples:

>>> from scrapling import Selector, Fetcher
>>> page = Selector(html_doc, adaptive=True)
# OR
>>> Fetcher.adaptive = True
>>> page = Fetcher.get('https://example.com')

If you are using the Selector class, you need to pass the url of the website you are using with the argument url so Scrapling can separate the properties saved for each element by domain.

If you didn't pass a URL, the word default will be used in place of the URL field while saving the element's unique properties. So, this will only be an issue if you use the same identifier later for a different website and don't pass the URL parameter when initializing it. The save process overwrites previous data, and the adaptive feature uses only the latest saved properties.

Besides those arguments, we have storage and storage_args. Both are for the class to connect to the database; by default, it uses the SQLite class provided by the library. Those arguments shouldn't matter unless you want to write your own storage system, which we will cover on a separate page in the development section.

Now that you've enabled the adaptive feature globally, you have two main ways to use it.

The CSS/XPath Selection way

As you have seen in the example above, first, you have to use the auto_save argument while selecting an element that exists on the page, like below

element = page.css('#p1', auto_save=True)

And when the element doesn't exist, you can use the same selector and the adaptive argument, and the library will find it for you

element = page.css('#p1', adaptive=True)

Pretty simple, eh?

Well, a lot happened under the hood here. Remember the identifier we mentioned before that you need to set to retrieve the element you want? Here, with the css/xpath methods, the identifier is set automatically as the selector you passed here to make things easier :)

Additionally, for all these methods, you can pass the identifier argument to set it yourself. This is useful in some instances, or you can use it to save properties with the auto_save argument.

The manual way

You manually save and retrieve an element, then relocate it, which all happens within the adaptive feature, as shown below. This allows you to relocate any element using any method or selection!

First, let's say you got an element like this by text:

>>> element = page.find_by_text('Tipping the Velvet', first_match=True)

You can save its unique properties using the save method, as shown below, but you must set the identifier yourself. For this example, I chose my_special_element as an identifier, but it's best to use a meaningful identifier in your code for the same reason you use meaningful variable names :)

>>> page.save(element, 'my_special_element')

Now, later, when you want to retrieve it and relocate it inside the page with adaptive, it would be like this

>>> element_dict = page.retrieve('my_special_element')
>>> page.relocate(element_dict, selector_type=True)
[<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]
>>> page.relocate(element_dict, selector_type=True).css('::text').getall()
['Tipping the Velvet']

Hence, the retrieve and relocate methods are used.

If you want to keep it as a lxml.etree object, leave the selector_type argument

>>> page.relocate(element_dict)
[<Element a at 0x105a2a7b0>]

Troubleshooting

No Matches Found

# 1. Check if data was saved
element_data = page.retrieve('identifier')
if not element_data:
    print("No data saved for this identifier")

# 2. Try with different identifier
products = page.css('.product', adaptive=True, identifier='old_selector')

# 3. Save again with new identifier
products = page.css('.new-product', auto_save=True, identifier='new_identifier')

Wrong Elements Matched

# Use more specific selectors
products = page.css('.product-list .product', auto_save=True)

# Or save with more context
product = page.find_by_text('Product Name').parent
page.save(product, 'specific_product')

Known Issues

In the adaptive save process, only the unique properties of the first element in the selection results are saved. So if the selector you are using selects different elements on the page in other locations, adaptive will return the first element to you only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector, for example), as these selectors are separated and each is executed alone.

Final thoughts

Explaining this feature in detail without complications turned out to be challenging. However, still, if there's something left unclear, you can head out to the discussions section, and I will reply to you ASAP, or the Discord server, or reach out to me privately and have a chat :)