Karim shoair commited on
Commit ·
baf9852
1
Parent(s): 75573dc
Corrections and rephrasing
Browse files
README.md
CHANGED
|
@@ -145,7 +145,7 @@ As you see, Scrapling is on par with Scrapy and slightly faster than Lxml which
|
|
| 145 |
| Scrapling | 2.51 | 1.0x |
|
| 146 |
| AutoScraper | 11.41 | 4.546x |
|
| 147 |
|
| 148 |
-
Scrapling can find elements with more methods and it returns full element `Adaptor` objects not only the text like AutoScraper. So, to make this test fair, both libraries will extract an element with text, find similar elements, and then extract the text content for all of them. As you see, Scrapling is still 4.5 times faster at same task.
|
| 149 |
|
| 150 |
> All benchmarks' results are an average of 100 runs. See our [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) for methodology and to run your comparisons.
|
| 151 |
|
|
@@ -154,7 +154,7 @@ Scrapling is a breeze to get started with - Starting from version 0.2, we requir
|
|
| 154 |
```bash
|
| 155 |
pip3 install scrapling
|
| 156 |
```
|
| 157 |
-
- For using the `StealthyFetcher`, go
|
| 158 |
<details><summary>Windows OS</summary>
|
| 159 |
|
| 160 |
```bash
|
|
@@ -185,21 +185,21 @@ On a fresh installation of Linux, you may also need the following Firefox depend
|
|
| 185 |
|
| 186 |
<small> See the official <a href="https://camoufox.com/python/installation/#download-the-browser">Camoufox documentation</a> for more info on installation</small>
|
| 187 |
|
| 188 |
-
- If you are going to use the `PlayWrightFetcher` options, then install
|
| 189 |
```commandline
|
| 190 |
playwright install chromium
|
| 191 |
```
|
| 192 |
-
- If you are going to use normal requests only with `Fetcher` class then update the fingerprints files with:
|
| 193 |
```commandline
|
| 194 |
python -m browserforge update
|
| 195 |
```
|
| 196 |
|
| 197 |
## Fetching Websites Features
|
| 198 |
-
All fetcher-type classes are imported
|
| 199 |
```python
|
| 200 |
from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
|
| 201 |
```
|
| 202 |
-
And all of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `storage`, `storage_args`, and `debug` which are the same ones you give to `Adaptor` class.
|
| 203 |
> [!NOTE]
|
| 204 |
> The `auto_match` argument is enabled by default which is the one you should care about the most as you will see later.
|
| 205 |
### Fetcher
|
|
@@ -213,7 +213,7 @@ For all methods, you have `stealth_headers` which makes `Fetcher` create and use
|
|
| 213 |
>> page = Fetcher().delete('https://httpbin.org/delete')
|
| 214 |
```
|
| 215 |
### StealthyFetcher
|
| 216 |
-
This class is built on top of [Camoufox](https://github.com/daijro/camoufox) which
|
| 217 |
```python
|
| 218 |
>> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection') # Running headless by default
|
| 219 |
>> page.status == 200
|
|
@@ -228,14 +228,14 @@ True
|
|
| 228 |
| block_images | Prevent the loading of images through Firefox preferences. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | ✔️ |
|
| 229 |
| disable_resources | Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | ✔️ |
|
| 230 |
| google_search | Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. | ✔️ |
|
| 231 |
-
| extra_headers | A dictionary of extra headers to add
|
| 232 |
| block_webrtc | Blocks WebRTC entirely. | ✔️ |
|
| 233 |
-
| page_action | Added for automation. A function that takes the `page` object,
|
| 234 |
| addons | List of Firefox addons to use. **Must be paths to extracted addons.** | ✔️ |
|
| 235 |
-
| humanize | Humanize the cursor movement. Takes either True
|
| 236 |
| allow_webgl | Whether to allow WebGL. To prevent leaks, only use this for special cases. | ✔️ |
|
| 237 |
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
|
| 238 |
-
| timeout | The timeout in milliseconds that
|
| 239 |
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
| 240 |
| wait_selector_state | The state to wait for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
| 241 |
|
|
@@ -244,18 +244,18 @@ True
|
|
| 244 |
This list isn't final so expect a lot more additions and flexibility to be added in the next versions!
|
| 245 |
|
| 246 |
### PlayWrightFetcher
|
| 247 |
-
This class is built on top of [Playwright](https://playwright.dev/python/) which currently provides 4 main run options but they can be mixed
|
| 248 |
```python
|
| 249 |
>> page = PlayWrightFetcher().fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # Vanilla Playwright option
|
| 250 |
>> page.adaptor.css_first("#search a::attr(href)")
|
| 251 |
'https://github.com/D4Vinci/Scrapling'
|
| 252 |
```
|
| 253 |
-
Using this Fetcher class, you can
|
| 254 |
1) Vanilla Playwright without any modifications other than the ones you chose.
|
| 255 |
-
2) Stealthy Playwright with the stealth mode I wrote for it. It's still a WIP but it bypasses many online tests like [Sannysoft's](https://bot.sannysoft.com/).</br> Some of the things this fetcher's stealth mode
|
| 256 |
* Patching the CDP runtime fingerprint.
|
| 257 |
-
* Mimics some of real browsers' properties by
|
| 258 |
-
* Using custom flags on launch to hide
|
| 259 |
* Generates real browser's headers of the same type and same user OS then append it to the request's headers.
|
| 260 |
3) Real browsers by passing the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
|
| 261 |
4) [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option.
|
|
@@ -271,12 +271,12 @@ Add that to a lot of controlling/hiding options as you will see in the arguments
|
|
| 271 |
| disable_resources | Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | ✔️ |
|
| 272 |
| useragent | Pass a useragent string to be used. **Otherwise the fetcher will generate a real Useragent of the same browser and use it.** | ✔️ |
|
| 273 |
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
|
| 274 |
-
| timeout | The timeout in milliseconds that
|
| 275 |
-
| page_action | Added for automation. A function that takes the `page` object,
|
| 276 |
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
| 277 |
| wait_selector_state | The state to wait for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
| 278 |
| google_search | Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. | ✔️ |
|
| 279 |
-
| extra_headers | A dictionary of extra headers to add
|
| 280 |
| hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
|
| 281 |
| disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | ✔️ |
|
| 282 |
| stealth | Enables stealth mode, always check the documentation to see what stealth mode does currently. | ✔️ |
|
|
@@ -343,7 +343,7 @@ You can select elements by their text content in multiple ways, here's a full ex
|
|
| 343 |
```python
|
| 344 |
>>> page = Fetcher().get('https://books.toscrape.com/index.html').adaptor
|
| 345 |
|
| 346 |
-
>>> page.find_by_text('Tipping the Velvet') # Find the first element
|
| 347 |
<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>
|
| 348 |
|
| 349 |
>>> page.find_by_text('Tipping the Velvet', first_match=False) # Get all matches if there are more
|
|
@@ -410,7 +410,7 @@ Let's say you are scraping a page with a structure like this:
|
|
| 410 |
</section>
|
| 411 |
</div>
|
| 412 |
```
|
| 413 |
-
|
| 414 |
```python
|
| 415 |
page.css('#p1')
|
| 416 |
```
|
|
@@ -435,7 +435,7 @@ When website owners implement structural changes like
|
|
| 435 |
</div>
|
| 436 |
</div>
|
| 437 |
```
|
| 438 |
-
The selector will no longer function and your code needs maintenance. That's where Scrapling auto-matching feature comes into play.
|
| 439 |
|
| 440 |
```python
|
| 441 |
from scrapling import Adaptor
|
|
@@ -448,12 +448,12 @@ if not element: # One day website changes?
|
|
| 448 |
```
|
| 449 |
> How does the auto-matching work? Check the [FAQs](#-enlightening-questions-and-faqs) section for that and other possible issues while auto-matching.
|
| 450 |
|
| 451 |
-
#### Real
|
| 452 |
-
Let's use a real website as an example and use one of the fetchers to fetch its source. To do this we need to find a website that will change its design/structure soon
|
| 453 |
|
| 454 |
To solve this issue, I will use [The Web Archive](https://archive.org/)'s [Wayback Machine](https://web.archive.org/). Here is a copy of [StackOverFlow's website in 2010](https://web.archive.org/web/20100102003420/http://stackoverflow.com/), pretty old huh?</br>Let's test if the automatch feature can extract the same button in the old design from 2010 and the current design using the same selector :)
|
| 455 |
|
| 456 |
-
If I want to extract the Questions button from the old design I can use a selector like this `#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a`
|
| 457 |
Now let's test the same selector in both versions
|
| 458 |
```python
|
| 459 |
>> from scrapling import Fetcher
|
|
@@ -472,9 +472,9 @@ Now let's test the same selector in both versions
|
|
| 472 |
... print('Scrapling found the same element in the old design and the new design!')
|
| 473 |
'Scrapling found the same element in the old design and the new design!'
|
| 474 |
```
|
| 475 |
-
Note that I used a new argument called `automatch_domain`, this because for Scrapling these are two different URLs not the website so it isolates their data. To tell Scrapling they are the same website, we
|
| 476 |
|
| 477 |
-
In
|
| 478 |
|
| 479 |
**Notes:**
|
| 480 |
1. For the two examples above I used one time the `Adaptor` class and the second time the `Fetcher` class just to show you that you can create the `Adaptor` object by yourself if you have the source or fetch the source using any `Fetcher` class then it will create the `Adaptor` object for you on the `.adaptor` property.
|
|
@@ -494,25 +494,25 @@ In real world scenario, the code will be the same expect it will use the same UR
|
|
| 494 |
```
|
| 495 |
|
| 496 |
### Find elements by filters
|
| 497 |
-
Inspired by BeautifulSoup's `find_all` function you can find elements by using `find_all`/`find` methods. Both methods can take multiple types of filters and
|
| 498 |
|
| 499 |
* To be more specific:
|
| 500 |
* Any string passed is considered a tag name
|
| 501 |
* Any iterable passed like List/Tuple/Set is considered an iterable of tag names.
|
| 502 |
-
* Any dictionary is considered a mapping of HTML element(s)
|
| 503 |
* Any regex patterns passed are used as filters
|
| 504 |
* Any functions passed are used as filters
|
| 505 |
-
* Any keyword argument passed is considered as
|
| 506 |
|
| 507 |
So the way it works is after collecting all passed arguments and keywords, each filter passes its results to the following filter in a waterfall-like filtering system.
|
| 508 |
<br/>It filters all elements in the current page/element in the following order:
|
| 509 |
|
| 510 |
1. All elements with the passed tag name(s).
|
| 511 |
-
2. All elements that
|
| 512 |
-
3. All elements that
|
| 513 |
-
4. All elements that
|
| 514 |
|
| 515 |
-
Note: The filtering process always starts from the first filter it finds in the filtering order above so if no tag name(s) are passed but attributes passed, the process starts from that layer and so
|
| 516 |
|
| 517 |
Examples to clear any confusion :)
|
| 518 |
|
|
@@ -525,7 +525,7 @@ Examples to clear any confusion :)
|
|
| 525 |
<data='<div class="row header-box"> <div class=...' parent='<div class="container"> <div class="row...'>,
|
| 526 |
...]
|
| 527 |
|
| 528 |
-
# Find all div elements with class that equals `quote`.
|
| 529 |
>> page.find_all('div', class_='quote')
|
| 530 |
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 531 |
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
|
@@ -537,29 +537,29 @@ Examples to clear any confusion :)
|
|
| 537 |
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 538 |
...]
|
| 539 |
|
| 540 |
-
# Find all elements with class that equals `quote`.
|
| 541 |
>> page.find_all({'class': 'quote'})
|
| 542 |
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 543 |
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 544 |
...]
|
| 545 |
|
| 546 |
-
# Find all div elements with class that equals `quote`, and contains the element `.text` which contains the word 'world' in its content.
|
| 547 |
>> page.find_all('div', {'class': 'quote'}, lambda e: "world" in e.css_first('.text::text'))
|
| 548 |
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>]
|
| 549 |
|
| 550 |
-
# Find all elements that
|
| 551 |
>> page.find_all(lambda element: len(element.children) > 0)
|
| 552 |
[<data='<html lang="en"><head><meta charset="UTF...'>,
|
| 553 |
<data='<head><meta charset="UTF-8"><title>Quote...' parent='<html lang="en"><head><meta charset="UTF...'>,
|
| 554 |
<data='<body> <div class="container"> <div clas...' parent='<html lang="en"><head><meta charset="UTF...'>,
|
| 555 |
...]
|
| 556 |
|
| 557 |
-
# Find all elements that
|
| 558 |
>> page.find_all(lambda element: "world" in element.text)
|
| 559 |
[<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>,
|
| 560 |
<data='<a class="tag" href="/tag/world/page/1/"...' parent='<div class="tags"> Tags: <meta class="ke...'>]
|
| 561 |
|
| 562 |
-
# Find all span elements that
|
| 563 |
>> page.find_all('span', re.compile(r'world'))
|
| 564 |
[<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>]
|
| 565 |
|
|
@@ -586,12 +586,12 @@ Here's what else you can do with Scrapling:
|
|
| 586 |
```
|
| 587 |
- Saving and retrieving elements manually to auto-match them outside the `css` and the `xpath` methods but you have to set the identifier by yourself.
|
| 588 |
|
| 589 |
-
- To save element to the database:
|
| 590 |
```python
|
| 591 |
>>> element = page.find_by_text('Tipping the Velvet', first_match=True)
|
| 592 |
>>> page.save(element, 'my_special_element')
|
| 593 |
```
|
| 594 |
-
- Now later when you want to retrieve it and relocate it
|
| 595 |
```python
|
| 596 |
>>> element_dict = page.retrieve('my_special_element')
|
| 597 |
>>> page.relocate(element_dict, adaptor_type=True)
|
|
@@ -615,7 +615,7 @@ expensive_products = page.css('.product_pod').filter(
|
|
| 615 |
|
| 616 |
- Searching results for the first one that matches a function
|
| 617 |
```python
|
| 618 |
-
# Find all the
|
| 619 |
page.css('.product_pod').search(
|
| 620 |
lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) == 54.23
|
| 621 |
)
|
|
@@ -636,7 +636,7 @@ page.css('.product_pod').search(
|
|
| 636 |
case_sensitive= False, # Set the regex to ignore letters case while compiling it
|
| 637 |
)
|
| 638 |
```
|
| 639 |
-
Hence all of these methods are
|
| 640 |
|
| 641 |
|
| 642 |
- Doing operations on the text content itself includes
|
|
@@ -648,13 +648,13 @@ page.css('.product_pod').search(
|
|
| 648 |
```python
|
| 649 |
page.xpath('//script/text()').re_first(r'var dataLayer = (.+);').json()
|
| 650 |
```
|
| 651 |
-
- Sort all characters in the string as if it
|
| 652 |
```python
|
| 653 |
quote.sort(reverse=False)
|
| 654 |
```
|
| 655 |
> To be clear, `TextHandler` is a sub-class of Python's `str` so all normal operations/methods that work with Python strings will work with it.
|
| 656 |
|
| 657 |
-
- Any element's attributes are not exactly a dictionary but a sub-class of [mapping](https://docs.python.org/3/glossary.html#term-mapping) called `AttributesHandler` that's read-only so it's faster and string values returned are actually `TextHandler` objects so all operations above can be done on them, standard dictionary operations that
|
| 658 |
- Unlike standard dictionaries, here you can search by values too and can do partial searches. It might be handy in some cases (returns a generator of matches)
|
| 659 |
```python
|
| 660 |
>>> for item in element.attrib.search_values('catalogue', partial=True):
|
|
@@ -681,7 +681,7 @@ There are a lot of deep details skipped here to make this as short as possible s
|
|
| 681 |
|
| 682 |
Note that implementing your storage system can be complex as there are some strict rules such as inheriting from the same abstract class, following the singleton design pattern used in other classes, and more. So make sure to read the docs first.
|
| 683 |
|
| 684 |
-
To give
|
| 685 |
|
| 686 |
## ⚡ Enlightening Questions and FAQs
|
| 687 |
This section addresses common questions about Scrapling, please read this section before opening an issue.
|
|
@@ -696,7 +696,7 @@ This section addresses common questions about Scrapling, please read this sectio
|
|
| 696 |
Together both are used to retrieve the element's unique properties from the database later.
|
| 697 |
4. Now later when you enable the `auto_match` parameter for both the Adaptor instance and the method call. The element properties are retrieved and Scrapling loops over all elements in the page and compares each one's unique properties to the unique properties we already have for this element and a score is calculated for each one.
|
| 698 |
5. The comparison between elements is not exact but more about finding how similar these values are, so everything is taken into consideration even the values' order like the order in which the element class names were written before and the order in which the same element class names are written now.
|
| 699 |
-
6. The score for each element is stored in the table and in the end, the element(s) with the highest combined similarity scores are returned.
|
| 700 |
|
| 701 |
### How does the auto-matching work if I didn't pass a URL while initializing the Adaptor object?
|
| 702 |
Not a big problem as it depends on your usage. The word `default` will be used in place of the URL field while saving the element's unique properties. So this will only be an issue if you used the same identifier later for a different website that you didn't pass the URL parameter while initializing it as well. The save process will overwrite the previous data and auto-matching uses the latest saved properties only.
|
|
@@ -725,7 +725,7 @@ Pretty much yeah, almost all features you get from BeautifulSoup can be found or
|
|
| 725 |
Of course, you can find elements by text/regex, find similar elements in a more reliable way than AutoScraper, and finally save/retrieve elements manually to use later as the model feature in AutoScraper. I have pulled all top articles about AutoScraper from Google and tested Scrapling against examples in them. In all examples, Scrapling got the same results as AutoScraper in much less time.
|
| 726 |
|
| 727 |
### Is Scrapling thread-safe?
|
| 728 |
-
Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its
|
| 729 |
|
| 730 |
## Sponsors
|
| 731 |
[](https://www.capsolver.com/?utm_source=github&utm_medium=repo&utm_campaign=scraping&utm_term=Scrapling)
|
|
@@ -737,7 +737,7 @@ Please read the [contributing file](https://github.com/D4Vinci/Scrapling/blob/ma
|
|
| 737 |
|
| 738 |
## Disclaimer for Scrapling Project
|
| 739 |
> [!CAUTION]
|
| 740 |
-
> This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international laws regarding data scraping and privacy. The authors and contributors are not responsible for any misuse of this software. This library should not be used to violate the rights of others, for unethical purposes, or to use data in an unauthorized or illegal manner. Do not use it on any website unless you have permission from the website owner or within their allowed rules like `robots.txt` file, for example.
|
| 741 |
|
| 742 |
## License
|
| 743 |
This work is licensed under BSD-3
|
|
|
|
| 145 |
| Scrapling | 2.51 | 1.0x |
|
| 146 |
| AutoScraper | 11.41 | 4.546x |
|
| 147 |
|
| 148 |
+
Scrapling can find elements with more methods and it returns full element `Adaptor` objects not only the text like AutoScraper. So, to make this test fair, both libraries will extract an element with text, find similar elements, and then extract the text content for all of them. As you see, Scrapling is still 4.5 times faster at the same task.
|
| 149 |
|
| 150 |
> All benchmarks' results are an average of 100 runs. See our [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) for methodology and to run your comparisons.
|
| 151 |
|
|
|
|
| 154 |
```bash
|
| 155 |
pip3 install scrapling
|
| 156 |
```
|
| 157 |
+
- For using the `StealthyFetcher`, go to the command line and download the browser with
|
| 158 |
<details><summary>Windows OS</summary>
|
| 159 |
|
| 160 |
```bash
|
|
|
|
| 185 |
|
| 186 |
<small> See the official <a href="https://camoufox.com/python/installation/#download-the-browser">Camoufox documentation</a> for more info on installation</small>
|
| 187 |
|
| 188 |
+
- If you are going to use the `PlayWrightFetcher` options, then install Playwright's Chromium browser with:
|
| 189 |
```commandline
|
| 190 |
playwright install chromium
|
| 191 |
```
|
| 192 |
+
- If you are going to use normal requests only with the `Fetcher` class then update the fingerprints files with:
|
| 193 |
```commandline
|
| 194 |
python -m browserforge update
|
| 195 |
```
|
| 196 |
|
| 197 |
## Fetching Websites Features
|
| 198 |
+
All fetcher-type classes are imported in the same way
|
| 199 |
```python
|
| 200 |
from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
|
| 201 |
```
|
| 202 |
+
And all of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `storage`, `storage_args`, and `debug` which are the same ones you give to the `Adaptor` class.
|
| 203 |
> [!NOTE]
|
| 204 |
> The `auto_match` argument is enabled by default which is the one you should care about the most as you will see later.
|
| 205 |
### Fetcher
|
|
|
|
| 213 |
>> page = Fetcher().delete('https://httpbin.org/delete')
|
| 214 |
```
|
| 215 |
### StealthyFetcher
|
| 216 |
+
This class is built on top of [Camoufox](https://github.com/daijro/camoufox) which by default bypasses most of the anti-bot protections. Scrapling adds extra layers of flavors and configurations to increase performance and undetectability even further.
|
| 217 |
```python
|
| 218 |
>> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection') # Running headless by default
|
| 219 |
>> page.status == 200
|
|
|
|
| 228 |
| block_images | Prevent the loading of images through Firefox preferences. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | ✔️ |
|
| 229 |
| disable_resources | Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | ✔️ |
|
| 230 |
| google_search | Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. | ✔️ |
|
| 231 |
+
| extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
|
| 232 |
| block_webrtc | Blocks WebRTC entirely. | ✔️ |
|
| 233 |
+
| page_action | Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again. | ✔️ |
|
| 234 |
| addons | List of Firefox addons to use. **Must be paths to extracted addons.** | ✔️ |
|
| 235 |
+
| humanize | Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window. | ✔️ |
|
| 236 |
| allow_webgl | Whether to allow WebGL. To prevent leaks, only use this for special cases. | ✔️ |
|
| 237 |
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
|
| 238 |
+
| timeout | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000. | ✔️ |
|
| 239 |
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
| 240 |
| wait_selector_state | The state to wait for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
| 241 |
|
|
|
|
| 244 |
This list isn't final so expect a lot more additions and flexibility to be added in the next versions!
|
| 245 |
|
| 246 |
### PlayWrightFetcher
|
| 247 |
+
This class is built on top of [Playwright](https://playwright.dev/python/) which currently provides 4 main run options but they can be mixed as you want.
|
| 248 |
```python
|
| 249 |
>> page = PlayWrightFetcher().fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # Vanilla Playwright option
|
| 250 |
>> page.adaptor.css_first("#search a::attr(href)")
|
| 251 |
'https://github.com/D4Vinci/Scrapling'
|
| 252 |
```
|
| 253 |
+
Using this Fetcher class, you can make requests with:
|
| 254 |
1) Vanilla Playwright without any modifications other than the ones you chose.
|
| 255 |
+
2) Stealthy Playwright with the stealth mode I wrote for it. It's still a WIP but it bypasses many online tests like [Sannysoft's](https://bot.sannysoft.com/).</br> Some of the things this fetcher's stealth mode does include:
|
| 256 |
* Patching the CDP runtime fingerprint.
|
| 257 |
+
* Mimics some of the real browsers' properties by injecting several JS files and using custom options.
|
| 258 |
+
* Using custom flags on launch to hide Playwright even more and make it faster.
|
| 259 |
* Generates real browser's headers of the same type and same user OS then append it to the request's headers.
|
| 260 |
3) Real browsers by passing the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
|
| 261 |
4) [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option.
|
|
|
|
| 271 |
| disable_resources | Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | ✔️ |
|
| 272 |
| useragent | Pass a useragent string to be used. **Otherwise the fetcher will generate a real Useragent of the same browser and use it.** | ✔️ |
|
| 273 |
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
|
| 274 |
+
| timeout | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000. | ✔️ |
|
| 275 |
+
| page_action | Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again. | ✔️ |
|
| 276 |
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
| 277 |
| wait_selector_state | The state to wait for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
| 278 |
| google_search | Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. | ✔️ |
|
| 279 |
+
| extra_headers | A dictionary of extra headers to add to the request. The referer set by the `google_search` argument takes priority over the referer set here if used together. | ✔️ |
|
| 280 |
| hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
|
| 281 |
| disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | ✔️ |
|
| 282 |
| stealth | Enables stealth mode, always check the documentation to see what stealth mode does currently. | ✔️ |
|
|
|
|
| 343 |
```python
|
| 344 |
>>> page = Fetcher().get('https://books.toscrape.com/index.html').adaptor
|
| 345 |
|
| 346 |
+
>>> page.find_by_text('Tipping the Velvet') # Find the first element whose text fully matches this text
|
| 347 |
<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>
|
| 348 |
|
| 349 |
>>> page.find_by_text('Tipping the Velvet', first_match=False) # Get all matches if there are more
|
|
|
|
| 410 |
</section>
|
| 411 |
</div>
|
| 412 |
```
|
| 413 |
+
And you want to scrape the first product, the one with the `p1` ID. You will probably write a selector like this
|
| 414 |
```python
|
| 415 |
page.css('#p1')
|
| 416 |
```
|
|
|
|
| 435 |
</div>
|
| 436 |
</div>
|
| 437 |
```
|
| 438 |
+
The selector will no longer function and your code needs maintenance. That's where Scrapling's auto-matching feature comes into play.
|
| 439 |
|
| 440 |
```python
|
| 441 |
from scrapling import Adaptor
|
|
|
|
| 448 |
```
|
| 449 |
> How does the auto-matching work? Check the [FAQs](#-enlightening-questions-and-faqs) section for that and other possible issues while auto-matching.
|
| 450 |
|
| 451 |
+
#### Real-World Scenario
|
| 452 |
+
Let's use a real website as an example and use one of the fetchers to fetch its source. To do this we need to find a website that will change its design/structure soon, take a copy of its source then wait for the website to make the change. Of course, that's nearly impossible to know unless I know the website's owner but that will make it a staged test haha.
|
| 453 |
|
| 454 |
To solve this issue, I will use [The Web Archive](https://archive.org/)'s [Wayback Machine](https://web.archive.org/). Here is a copy of [StackOverFlow's website in 2010](https://web.archive.org/web/20100102003420/http://stackoverflow.com/), pretty old huh?</br>Let's test if the automatch feature can extract the same button in the old design from 2010 and the current design using the same selector :)
|
| 455 |
|
| 456 |
+
If I want to extract the Questions button from the old design I can use a selector like this `#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a` This selector is too specific because it was generated by Google Chrome.
|
| 457 |
Now let's test the same selector in both versions
|
| 458 |
```python
|
| 459 |
>> from scrapling import Fetcher
|
|
|
|
| 472 |
... print('Scrapling found the same element in the old design and the new design!')
|
| 473 |
'Scrapling found the same element in the old design and the new design!'
|
| 474 |
```
|
| 475 |
+
Note that I used a new argument called `automatch_domain`, this is because for Scrapling these are two different URLs, not the website so it isolates their data. To tell Scrapling they are the same website, we then pass the domain we want to use for saving auto-match data for them both so Scrapling doesn't isolate them.
|
| 476 |
|
| 477 |
+
In a real-world scenario, the code will be the same except it will use the same URL for both requests so you won't need to use the `automatch_domain` argument. This is the closest example I can give to real-world cases so I hope it didn't confuse you :)
|
| 478 |
|
| 479 |
**Notes:**
|
| 480 |
1. For the two examples above I used one time the `Adaptor` class and the second time the `Fetcher` class just to show you that you can create the `Adaptor` object by yourself if you have the source or fetch the source using any `Fetcher` class then it will create the `Adaptor` object for you on the `.adaptor` property.
|
|
|
|
| 494 |
```
|
| 495 |
|
| 496 |
### Find elements by filters
|
| 497 |
+
Inspired by BeautifulSoup's `find_all` function you can find elements by using `find_all`/`find` methods. Both methods can take multiple types of filters and return all elements in the pages that all these filters apply to.
|
| 498 |
|
| 499 |
* To be more specific:
|
| 500 |
* Any string passed is considered a tag name
|
| 501 |
* Any iterable passed like List/Tuple/Set is considered an iterable of tag names.
|
| 502 |
+
* Any dictionary is considered a mapping of HTML element(s) attribute names and attribute values.
|
| 503 |
* Any regex patterns passed are used as filters
|
| 504 |
* Any functions passed are used as filters
|
| 505 |
+
* Any keyword argument passed is considered as an HTML element attribute with its value.
|
| 506 |
|
| 507 |
So the way it works is after collecting all passed arguments and keywords, each filter passes its results to the following filter in a waterfall-like filtering system.
|
| 508 |
<br/>It filters all elements in the current page/element in the following order:
|
| 509 |
|
| 510 |
1. All elements with the passed tag name(s).
|
| 511 |
+
2. All elements that match all passed attribute(s).
|
| 512 |
+
3. All elements that match all passed regex patterns.
|
| 513 |
+
4. All elements that fulfill all passed function(s).
|
| 514 |
|
| 515 |
+
Note: The filtering process always starts from the first filter it finds in the filtering order above so if no tag name(s) are passed but attributes are passed, the process starts from that layer and so on. **But the order in which you pass the arguments doesn't matter.**
|
| 516 |
|
| 517 |
Examples to clear any confusion :)
|
| 518 |
|
|
|
|
| 525 |
<data='<div class="row header-box"> <div class=...' parent='<div class="container"> <div class="row...'>,
|
| 526 |
...]
|
| 527 |
|
| 528 |
+
# Find all div elements with a class that equals `quote`.
|
| 529 |
>> page.find_all('div', class_='quote')
|
| 530 |
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 531 |
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
|
|
|
| 537 |
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 538 |
...]
|
| 539 |
|
| 540 |
+
# Find all elements with a class that equals `quote`.
|
| 541 |
>> page.find_all({'class': 'quote'})
|
| 542 |
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 543 |
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 544 |
...]
|
| 545 |
|
| 546 |
+
# Find all div elements with a class that equals `quote`, and contains the element `.text` which contains the word 'world' in its content.
|
| 547 |
>> page.find_all('div', {'class': 'quote'}, lambda e: "world" in e.css_first('.text::text'))
|
| 548 |
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>]
|
| 549 |
|
| 550 |
+
# Find all elements that don't have children.
|
| 551 |
>> page.find_all(lambda element: len(element.children) > 0)
|
| 552 |
[<data='<html lang="en"><head><meta charset="UTF...'>,
|
| 553 |
<data='<head><meta charset="UTF-8"><title>Quote...' parent='<html lang="en"><head><meta charset="UTF...'>,
|
| 554 |
<data='<body> <div class="container"> <div clas...' parent='<html lang="en"><head><meta charset="UTF...'>,
|
| 555 |
...]
|
| 556 |
|
| 557 |
+
# Find all elements that contain the word 'world' in its content.
|
| 558 |
>> page.find_all(lambda element: "world" in element.text)
|
| 559 |
[<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>,
|
| 560 |
<data='<a class="tag" href="/tag/world/page/1/"...' parent='<div class="tags"> Tags: <meta class="ke...'>]
|
| 561 |
|
| 562 |
+
# Find all span elements that match the given regex
|
| 563 |
>> page.find_all('span', re.compile(r'world'))
|
| 564 |
[<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>]
|
| 565 |
|
|
|
|
| 586 |
```
|
| 587 |
- Saving and retrieving elements manually to auto-match them outside the `css` and the `xpath` methods but you have to set the identifier by yourself.
|
| 588 |
|
| 589 |
+
- To save an element to the database:
|
| 590 |
```python
|
| 591 |
>>> element = page.find_by_text('Tipping the Velvet', first_match=True)
|
| 592 |
>>> page.save(element, 'my_special_element')
|
| 593 |
```
|
| 594 |
+
- Now later when you want to retrieve it and relocate it inside the page with auto-matching, it would be like this
|
| 595 |
```python
|
| 596 |
>>> element_dict = page.retrieve('my_special_element')
|
| 597 |
>>> page.relocate(element_dict, adaptor_type=True)
|
|
|
|
| 615 |
|
| 616 |
- Searching results for the first one that matches a function
|
| 617 |
```python
|
| 618 |
+
# Find all the products with price '53.23'
|
| 619 |
page.css('.product_pod').search(
|
| 620 |
lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) == 54.23
|
| 621 |
)
|
|
|
|
| 636 |
case_sensitive= False, # Set the regex to ignore letters case while compiling it
|
| 637 |
)
|
| 638 |
```
|
| 639 |
+
Hence all of these methods are methods from the `TextHandler` within that contains the text content so the same can be done directly if you call the `.text` property or equivalent selector function.
|
| 640 |
|
| 641 |
|
| 642 |
- Doing operations on the text content itself includes
|
|
|
|
| 648 |
```python
|
| 649 |
page.xpath('//script/text()').re_first(r'var dataLayer = (.+);').json()
|
| 650 |
```
|
| 651 |
+
- Sort all characters in the string as if it were a list and return the new string
|
| 652 |
```python
|
| 653 |
quote.sort(reverse=False)
|
| 654 |
```
|
| 655 |
> To be clear, `TextHandler` is a sub-class of Python's `str` so all normal operations/methods that work with Python strings will work with it.
|
| 656 |
|
| 657 |
+
- Any element's attributes are not exactly a dictionary but a sub-class of [mapping](https://docs.python.org/3/glossary.html#term-mapping) called `AttributesHandler` that's read-only so it's faster and string values returned are actually `TextHandler` objects so all operations above can be done on them, standard dictionary operations that don't modify the data, and more :)
|
| 658 |
- Unlike standard dictionaries, here you can search by values too and can do partial searches. It might be handy in some cases (returns a generator of matches)
|
| 659 |
```python
|
| 660 |
>>> for item in element.attrib.search_values('catalogue', partial=True):
|
|
|
|
| 681 |
|
| 682 |
Note that implementing your storage system can be complex as there are some strict rules such as inheriting from the same abstract class, following the singleton design pattern used in other classes, and more. So make sure to read the docs first.
|
| 683 |
|
| 684 |
+
To give detailed documentation of the library, it will need a website. I'm trying to rush creating the website, researching new ideas, and adding more features/tests/benchmarks but time is tight with too many spinning plates between work, personal life, and working on Scrapling. But you can help by using the [sponsor button](https://github.com/sponsors/D4Vinci) above :)
|
| 685 |
|
| 686 |
## ⚡ Enlightening Questions and FAQs
|
| 687 |
This section addresses common questions about Scrapling, please read this section before opening an issue.
|
|
|
|
| 696 |
Together both are used to retrieve the element's unique properties from the database later.
|
| 697 |
4. Now later when you enable the `auto_match` parameter for both the Adaptor instance and the method call. The element properties are retrieved and Scrapling loops over all elements in the page and compares each one's unique properties to the unique properties we already have for this element and a score is calculated for each one.
|
| 698 |
5. The comparison between elements is not exact but more about finding how similar these values are, so everything is taken into consideration even the values' order like the order in which the element class names were written before and the order in which the same element class names are written now.
|
| 699 |
+
6. The score for each element is stored in the table, and in the end, the element(s) with the highest combined similarity scores are returned.
|
| 700 |
|
| 701 |
### How does the auto-matching work if I didn't pass a URL while initializing the Adaptor object?
|
| 702 |
Not a big problem as it depends on your usage. The word `default` will be used in place of the URL field while saving the element's unique properties. So this will only be an issue if you used the same identifier later for a different website that you didn't pass the URL parameter while initializing it as well. The save process will overwrite the previous data and auto-matching uses the latest saved properties only.
|
|
|
|
| 725 |
Of course, you can find elements by text/regex, find similar elements in a more reliable way than AutoScraper, and finally save/retrieve elements manually to use later as the model feature in AutoScraper. I have pulled all top articles about AutoScraper from Google and tested Scrapling against examples in them. In all examples, Scrapling got the same results as AutoScraper in much less time.
|
| 726 |
|
| 727 |
### Is Scrapling thread-safe?
|
| 728 |
+
Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its state.
|
| 729 |
|
| 730 |
## Sponsors
|
| 731 |
[](https://www.capsolver.com/?utm_source=github&utm_medium=repo&utm_campaign=scraping&utm_term=Scrapling)
|
|
|
|
| 737 |
|
| 738 |
## Disclaimer for Scrapling Project
|
| 739 |
> [!CAUTION]
|
| 740 |
+
> This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international laws regarding data scraping and privacy. The authors and contributors are not responsible for any misuse of this software. This library should not be used to violate the rights of others, for unethical purposes, or to use data in an unauthorized or illegal manner. Do not use it on any website unless you have permission from the website owner or within their allowed rules like the `robots.txt` file, for example.
|
| 741 |
|
| 742 |
## License
|
| 743 |
This work is licensed under BSD-3
|