Karim shoair commited on
Commit ·
8d5cc87
1
Parent(s): 92b1671
docs: adding more data to the `adaptive` feature page
Browse files- docs/parsing/adaptive.md +5 -3
docs/parsing/adaptive.md
CHANGED
|
@@ -50,7 +50,7 @@ When website owners implement structural changes like
|
|
| 50 |
```
|
| 51 |
The selector will no longer function, and your code needs maintenance. That's where Scrapling's `adaptive` feature comes into play.
|
| 52 |
|
| 53 |
-
With Scrapling, you can enable the `adaptive` feature the first time you select an element, and the next time you select that element and it doesn't exist, Scrapling will remember its properties and search on the website for the element with the highest percentage of similarity to that element and without AI :)
|
| 54 |
|
| 55 |
```python
|
| 56 |
from scrapling import Selector, Fetcher
|
|
@@ -100,6 +100,8 @@ The code will be the same in a real-world scenario, except it will use the same
|
|
| 100 |
|
| 101 |
Hence, in the two examples above, I used both the `Selector` and `Fetcher` classes to show that the adaptive logic is the same.
|
| 102 |
|
|
|
|
|
|
|
| 103 |
## How the adaptive scraping feature works
|
| 104 |
Adaptive scraping works in two phases:
|
| 105 |
|
|
@@ -113,7 +115,7 @@ With as few technical details as possible, the general logic goes as follows:
|
|
| 113 |
1. You tell Scrapling to save that element's unique properties in one of the ways we will show below.
|
| 114 |
2. Scrapling uses its configured database (SQLite by default) and saves each element's unique properties.
|
| 115 |
3. Now, because everything about the element can be changed or removed by the website's owner(s), nothing from the element can be used as a unique identifier for the database. To solve this issue, I made the storage system rely on two things:
|
| 116 |
-
1. The domain of the current website. If you are using the `Selector` class, pass it when initializing
|
| 117 |
2. An `identifier` to query that element's properties from the database. You don't always have to set the identifier yourself; we'll discuss this later.
|
| 118 |
|
| 119 |
Together, they will later be used to retrieve the element's unique properties from the database.
|
|
@@ -148,7 +150,7 @@ If you are using the [Selector](main_classes.md#selector) class, you need to pas
|
|
| 148 |
|
| 149 |
If you didn't pass a URL, the word `default` will be used in place of the URL field while saving the element's unique properties. So, this will only be an issue if you use the same identifier later for a different website and don't pass the URL parameter when initializing it. The save process overwrites previous data, and the `adaptive` feature uses only the latest saved properties.
|
| 150 |
|
| 151 |
-
Besides those arguments, we have `storage` and `storage_args`. Both are for the class to connect to the database; by default, it
|
| 152 |
|
| 153 |
Now that you've enabled the `adaptive` feature globally, you have two main ways to use it.
|
| 154 |
|
|
|
|
| 50 |
```
|
| 51 |
The selector will no longer function, and your code needs maintenance. That's where Scrapling's `adaptive` feature comes into play.
|
| 52 |
|
| 53 |
+
With Scrapling, you can enable the `adaptive` feature the first time you select an element, and the next time you select that element and it doesn't exist, Scrapling will remember its properties and search on the website for the element with the highest percentage of similarity to that element, and without AI :)
|
| 54 |
|
| 55 |
```python
|
| 56 |
from scrapling import Selector, Fetcher
|
|
|
|
| 100 |
|
| 101 |
Hence, in the two examples above, I used both the `Selector` and `Fetcher` classes to show that the adaptive logic is the same.
|
| 102 |
|
| 103 |
+
> Note: the main reason for creating the `adaptive_domain` argument was to handle if the website changed its URL while changing the design/structure. In that case, you can use it to continue using the previously stored adaptive data for the new URL. Otherwise, scrapling will consider it a new website and discard the old data.
|
| 104 |
+
|
| 105 |
## How the adaptive scraping feature works
|
| 106 |
Adaptive scraping works in two phases:
|
| 107 |
|
|
|
|
| 115 |
1. You tell Scrapling to save that element's unique properties in one of the ways we will show below.
|
| 116 |
2. Scrapling uses its configured database (SQLite by default) and saves each element's unique properties.
|
| 117 |
3. Now, because everything about the element can be changed or removed by the website's owner(s), nothing from the element can be used as a unique identifier for the database. To solve this issue, I made the storage system rely on two things:
|
| 118 |
+
1. The domain of the current website. If you are using the `Selector` class, pass it when initializing; if you are using a fetcher, the domain will be automatically taken from the URL.
|
| 119 |
2. An `identifier` to query that element's properties from the database. You don't always have to set the identifier yourself; we'll discuss this later.
|
| 120 |
|
| 121 |
Together, they will later be used to retrieve the element's unique properties from the database.
|
|
|
|
| 150 |
|
| 151 |
If you didn't pass a URL, the word `default` will be used in place of the URL field while saving the element's unique properties. So, this will only be an issue if you use the same identifier later for a different website and don't pass the URL parameter when initializing it. The save process overwrites previous data, and the `adaptive` feature uses only the latest saved properties.
|
| 152 |
|
| 153 |
+
Besides those arguments, we have `storage` and `storage_args`. Both are for the class to connect to the database; by default, it uses the SQLite class provided by the library. Those arguments shouldn't matter unless you want to write your own storage system, which we will cover on a [separate page in the development section](../development/adaptive_storage_system.md).
|
| 154 |
|
| 155 |
Now that you've enabled the `adaptive` feature globally, you have two main ways to use it.
|
| 156 |
|