Karim shoair commited on
Commit ·
2ab4b45
1
Parent(s): e568fb5
First version of the `0.2` README
Browse filesI probably forgot some features and made some mistakes here and there so I will review it again later
README.md
CHANGED
|
@@ -1,30 +1,77 @@
|
|
| 1 |
-
# 🕷️
|
| 2 |
[](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml) [](https://badge.fury.io/py/Scrapling) [](https://pypi.org/project/scrapling/) [](https://pepy.tech/project/scrapling)
|
| 3 |
|
| 4 |
-
Dealing with failing web scrapers due to website changes? Meet Scrapling.
|
| 5 |
|
| 6 |
Scrapling is a high-performance, intelligent web scraping library for Python that automatically adapts to website changes while significantly outperforming popular alternatives. Whether you're a beginner or an expert, Scrapling provides powerful features while maintaining simplicity.
|
| 7 |
|
| 8 |
```python
|
| 9 |
-
from scrapling import
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
products = page.css('.product',
|
|
|
|
|
|
|
| 16 |
```
|
| 17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
## Key Features
|
| 19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
### Adaptive Scraping
|
| 21 |
- 🔄 **Smart Element Tracking**: Locate previously identified elements after website structure changes, using an intelligent similarity system and integrated storage.
|
| 22 |
-
- 🎯 **Flexible Querying**: Use CSS selectors, XPath, text search, or regex - chain them however you want!
|
| 23 |
- 🔍 **Find Similar Elements**: Automatically locate elements similar to the element you want on the page (Ex: other products like the product you found on the page).
|
| 24 |
-
- 🧠 **Smart Content Scraping**: Extract data from multiple websites without specific selectors using
|
| 25 |
|
| 26 |
### Performance
|
| 27 |
-
- 🚀 **Lightning Fast**: Built from the ground up with performance in mind, outperforming most popular Python scraping libraries (outperforming BeautifulSoup by up to
|
| 28 |
- 🔋 **Memory Efficient**: Optimized data structures for minimal memory footprint.
|
| 29 |
- ⚡ **Fast JSON serialization**: 10x faster JSON serialization than the standard json library with more options.
|
| 30 |
|
|
@@ -32,23 +79,18 @@ products = page.css('.product', auto_match=True) # Still finds them!
|
|
| 32 |
- 🛠️ **Powerful Navigation API**: Traverse the DOM tree easily in all directions and get the info you want (parent, ancestors, sibling, children, next/previous element, and more).
|
| 33 |
- 🧬 **Rich Text Processing**: All strings have built-in methods for regex matching, cleaning, and more. All elements' attributes are read-only dictionaries that are faster than standard dictionaries with added methods.
|
| 34 |
- 📝 **Automatic Selector Generation**: Create robust CSS/XPath selectors for any element.
|
| 35 |
-
- 🔌 **
|
| 36 |
-
- 📘 **Type hints**: Complete type coverage for better IDE support and fewer bugs.
|
| 37 |
|
| 38 |
## Getting Started
|
| 39 |
|
| 40 |
-
Let's walk through a basic example that demonstrates a small group of Scrapling's core features:
|
| 41 |
-
|
| 42 |
```python
|
| 43 |
-
import
|
| 44 |
-
from scrapling import Adaptor
|
| 45 |
|
| 46 |
-
|
| 47 |
-
url = 'https://quotes.toscrape.com/'
|
| 48 |
-
response = requests.get(url)
|
| 49 |
|
| 50 |
-
#
|
| 51 |
-
page =
|
| 52 |
# Get all strings in the full page
|
| 53 |
page.get_all_text(ignore_tags=('script', 'style'))
|
| 54 |
|
|
@@ -56,10 +98,17 @@ page.get_all_text(ignore_tags=('script', 'style'))
|
|
| 56 |
quotes = page.css('.quote .text::text') # CSS selector
|
| 57 |
quotes = page.xpath('//span[@class="text"]/text()') # XPath
|
| 58 |
quotes = page.css('.quote').css('.text::text') # Chained selectors
|
| 59 |
-
quotes = [element.text for element in page.css('.quote
|
| 60 |
|
| 61 |
# Get the first quote element
|
| 62 |
-
quote = page.css_first('.quote') #
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
# Working with elements
|
| 65 |
quote.html_content # Inner HTML
|
|
@@ -67,10 +116,9 @@ quote.prettify() # Prettified version of Inner HTML
|
|
| 67 |
quote.attrib # Element attributes
|
| 68 |
quote.path # DOM path to element (List)
|
| 69 |
```
|
| 70 |
-
To keep it simple, all methods can be chained on top of each other
|
| 71 |
-
|
| 72 |
|
| 73 |
-
## Performance
|
| 74 |
|
| 75 |
Scrapling isn't just powerful - it's also blazing fast. Scrapling implements many best practices, design patterns, and numerous optimizations to save fractions of seconds. All of that while focusing exclusively on parsing HTML documents.
|
| 76 |
Here are benchmarks comparing Scrapling to popular Python libraries in two tests.
|
|
@@ -146,7 +194,101 @@ playwright install chromium
|
|
| 146 |
python -m browserforge update
|
| 147 |
```
|
| 148 |
|
| 149 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
### Smart Navigation
|
| 151 |
```python
|
| 152 |
>>> quote.tag
|
|
@@ -166,14 +308,13 @@ python -m browserforge update
|
|
| 166 |
>>> quote.siblings
|
| 167 |
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 168 |
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 169 |
-
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 170 |
...]
|
| 171 |
|
| 172 |
>>> quote.next # gets the next element, the same logic applies to `quote.previous`
|
| 173 |
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>
|
| 174 |
|
| 175 |
-
>>> quote.children.
|
| 176 |
-
|
| 177 |
|
| 178 |
>>> quote.has_class('quote')
|
| 179 |
True
|
|
@@ -182,7 +323,7 @@ True
|
|
| 182 |
>>> quote.css_selector
|
| 183 |
'body > div > div:nth-of-type(2) > div > div'
|
| 184 |
|
| 185 |
-
# Test these selectors on your favorite browser or reuse them again in the library
|
| 186 |
>>> quote.xpath_selector
|
| 187 |
'//body/div/div[2]/div/div'
|
| 188 |
```
|
|
@@ -200,9 +341,7 @@ You can search for a specific ancestor of an element that satisfies a function,
|
|
| 200 |
### Content-based Selection & Finding Similar Elements
|
| 201 |
You can select elements by their text content in multiple ways, here's a full example on another website:
|
| 202 |
```python
|
| 203 |
-
>>>
|
| 204 |
-
|
| 205 |
-
>>> page = Adaptor(response.text, url=response.url)
|
| 206 |
|
| 207 |
>>> page.find_by_text('Tipping the Velvet') # Find the first element that its text fully matches this text
|
| 208 |
<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>
|
|
@@ -256,8 +395,6 @@ To increase the complexity a little bit, let's say we want to get all books' dat
|
|
| 256 |
The [documentation](https://github.com/D4Vinci/Scrapling/tree/main/docs/Examples) will provide more advanced examples.
|
| 257 |
|
| 258 |
### Handling Structural Changes
|
| 259 |
-
> Because [the internet archive](https://web.archive.org/) is down at the time of writing this, I can't use real websites as examples even though I tested that before (I mean browsing an old version of a website and then counting the current version of the website as structural changes)
|
| 260 |
-
|
| 261 |
Let's say you are scraping a page with a structure like this:
|
| 262 |
```html
|
| 263 |
<div class="container">
|
|
@@ -301,31 +438,137 @@ When website owners implement structural changes like
|
|
| 301 |
The selector will no longer function and your code needs maintenance. That's where Scrapling auto-matching feature comes into play.
|
| 302 |
|
| 303 |
```python
|
|
|
|
| 304 |
# Before the change
|
| 305 |
-
page = Adaptor(page_source, url='example.com'
|
| 306 |
element = page.css('#p1' auto_save=True)
|
| 307 |
if not element: # One day website changes?
|
| 308 |
-
element = page.css('#p1', auto_match=True) #
|
| 309 |
# the rest of the code...
|
| 310 |
```
|
| 311 |
> How does the auto-matching work? Check the [FAQs](#-enlightening-questions-and-faqs) section for that and other possible issues while auto-matching.
|
| 312 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 313 |
**Notes:**
|
| 314 |
-
1.
|
|
|
|
| 315 |
```text
|
| 316 |
Argument `auto_save` will be ignored because `auto_match` wasn't enabled on initialization. Check docs for more info.
|
| 317 |
```
|
| 318 |
This behavior is purely for performance reasons so the database gets created/connected only when you are planning to use the auto-matching features. Same case with the `auto_match` argument.
|
| 319 |
|
| 320 |
-
|
| 321 |
```python
|
| 322 |
page.css('body').css('#p1', auto_match=True)
|
| 323 |
```
|
| 324 |
because you can't auto-match a whole list, you have to be specific and do something like
|
| 325 |
```python
|
| 326 |
-
page.
|
| 327 |
```
|
| 328 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 329 |
### Is That All?
|
| 330 |
Here's what else you can do with Scrapling:
|
| 331 |
|
|
@@ -355,12 +598,37 @@ Here's what else you can do with Scrapling:
|
|
| 355 |
[<Element a at 0x105a2a7b0>]
|
| 356 |
```
|
| 357 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 358 |
- Doing operations on element content is the same as scrapy
|
| 359 |
```python
|
| 360 |
-
quote.re(r'
|
| 361 |
-
quote.re_first(r'
|
| 362 |
quote.json() # If the content text is jsonable, then convert it to json using `orjson` which is 10x faster than the standard json library and provides more options
|
| 363 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 364 |
Hence all of these methods are actually methods from the `TextHandler` within that contains the text content so the same can be done directly if you call the `.text` property or equivalent selector function.
|
| 365 |
|
| 366 |
|
|
@@ -373,9 +641,9 @@ Here's what else you can do with Scrapling:
|
|
| 373 |
```python
|
| 374 |
page.xpath('//script/text()').re_first(r'var dataLayer = (.+);').json()
|
| 375 |
```
|
| 376 |
-
- Sort all characters in the string as if it
|
| 377 |
```python
|
| 378 |
-
quote.sort()
|
| 379 |
```
|
| 380 |
> To be clear, `TextHandler` is a sub-class of Python's `str` so all normal operations/methods that work with Python strings will work with it.
|
| 381 |
|
|
@@ -406,6 +674,7 @@ There are a lot of deep details skipped here to make this as short as possible s
|
|
| 406 |
|
| 407 |
Note that implementing your storage system can be complex as there are some strict rules such as inheriting from the same abstract class, following the singleton design pattern used in other classes, and more. So make sure to read the docs first.
|
| 408 |
|
|
|
|
| 409 |
|
| 410 |
## ⚡ Enlightening Questions and FAQs
|
| 411 |
This section addresses common questions about Scrapling, please read this section before opening an issue.
|
|
@@ -460,6 +729,7 @@ Everybody is invited and welcome to contribute to Scrapling. There is a lot to d
|
|
| 460 |
Please read the [contributing file](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md) before doing anything.
|
| 461 |
|
| 462 |
## Disclaimer for Scrapling Project
|
|
|
|
| 463 |
> This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international laws regarding data scraping and privacy. The authors and contributors are not responsible for any misuse of this software. This library should not be used to violate the rights of others, for unethical purposes, or to use data in an unauthorized or illegal manner. Do not use it on any website unless you have permission from the website owner or within their allowed rules like `robots.txt` file, for example.
|
| 464 |
|
| 465 |
## License
|
|
@@ -470,11 +740,11 @@ This project includes code adapted from:
|
|
| 470 |
- Parsel (BSD License) - Used for [translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/translator.py) submodule
|
| 471 |
|
| 472 |
## Thanks and References
|
|
|
|
|
|
|
| 473 |
- [brotector](https://github.com/kaliiiiiiiiii/brotector)
|
| 474 |
- [fakebrowser](https://github.com/kkoooqq/fakebrowser)
|
| 475 |
- [rebrowser-patches](https://github.com/rebrowser/rebrowser-patches)
|
| 476 |
-
- [Vinyzu](https://github.com/Vinyzu)'s work on Playwright's mock on [Botright](https://github.com/Vinyzu/Botright)
|
| 477 |
-
- [Daijro](https://github.com/daijro)'s brilliant work on both [BrowserForge](https://github.com/daijro/browserforge) and [Camoufox](https://github.com/daijro/camoufox)
|
| 478 |
|
| 479 |
## Known Issues
|
| 480 |
- In the auto-matching save process, the unique properties of the first element from the selection results are the only ones that get saved. So if the selector you are using selects different elements on the page that are in different locations, auto-matching will probably return to you the first element only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector for example) as these selectors get separated and each selector gets executed alone.
|
|
|
|
| 1 |
+
# 🕷️ Scrapling: Undetectable, Lightning-Fast, Adaptive Web Scraping for Python
|
| 2 |
[](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml) [](https://badge.fury.io/py/Scrapling) [](https://pypi.org/project/scrapling/) [](https://pepy.tech/project/scrapling)
|
| 3 |
|
| 4 |
+
Dealing with failing web scrapers due to anti-bot protections or website changes? Meet Scrapling.
|
| 5 |
|
| 6 |
Scrapling is a high-performance, intelligent web scraping library for Python that automatically adapts to website changes while significantly outperforming popular alternatives. Whether you're a beginner or an expert, Scrapling provides powerful features while maintaining simplicity.
|
| 7 |
|
| 8 |
```python
|
| 9 |
+
>> from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
|
| 10 |
+
# Fetch websites' source under the radar!
|
| 11 |
+
>> fetcher = StealthyFetcher().fetch('https://example.com', headless=True, disable_resources=True)
|
| 12 |
+
>> print(fetcher.status)
|
| 13 |
+
200
|
| 14 |
+
>> page = fetcher.adaptor
|
| 15 |
+
>> products = page.css('.product', auto_save=True) # Scrape data that survives website design changes!
|
| 16 |
+
>> # Later, if the website structure changes, pass `auto_match=True`
|
| 17 |
+
>> products = page.css('.product', auto_match=True) # and Scrapling still finds them!
|
| 18 |
```
|
| 19 |
|
| 20 |
+
## Table of content
|
| 21 |
+
* [Key Features](#key-features)
|
| 22 |
+
* [Fetch websites as you prefer](#fetch-websites-as-you-prefer)
|
| 23 |
+
* [Adaptive Scraping](#adaptive-scraping)
|
| 24 |
+
* [Performance](#performance)
|
| 25 |
+
* [Developing Experience](#developing-experience)
|
| 26 |
+
* [Getting Started](#getting-started)
|
| 27 |
+
* [Parsing Performance](#parsing-performance)
|
| 28 |
+
* [Text Extraction Speed Test (5000 nested elements).](#text-extraction-speed-test-5000-nested-elements)
|
| 29 |
+
* [Extraction By Text Speed Test](#extraction-by-text-speed-test)
|
| 30 |
+
* [Installation](#installation)
|
| 31 |
+
* [Fetching Websites Features](#fetching-websites-features)
|
| 32 |
+
* [Fetcher](#fetcher)
|
| 33 |
+
* [StealthyFetcher](#stealthyfetcher)
|
| 34 |
+
* [PlayWrightFetcher](#playwrightfetcher)
|
| 35 |
+
* [Advanced Parsing Features](#advanced-parsing-features)
|
| 36 |
+
* [Smart Navigation](#smart-navigation)
|
| 37 |
+
* [Content-based Selection & Finding Similar Elements](#content-based-selection--finding-similar-elements)
|
| 38 |
+
* [Handling Structural Changes](#handling-structural-changes)
|
| 39 |
+
* [Real World Scenario](#real-world-scenario)
|
| 40 |
+
* [Find elements by filters](#find-elements-by-filters)
|
| 41 |
+
* [Is That All?](#is-that-all)
|
| 42 |
+
* [More Advanced Usage](#more-advanced-usage)
|
| 43 |
+
* [⚡ Enlightening Questions and FAQs](#-enlightening-questions-and-faqs)
|
| 44 |
+
* [How does auto-matching work?](#how-does-auto-matching-work)
|
| 45 |
+
* [How does the auto-matching work if I didn't pass a URL while initializing the Adaptor object?](#how-does-the-auto-matching-work-if-i-didnt-pass-a-url-while-initializing-the-adaptor-object)
|
| 46 |
+
* [If all things about an element can change or get removed, what are the unique properties to be saved?](#if-all-things-about-an-element-can-change-or-get-removed-what-are-the-unique-properties-to-be-saved)
|
| 47 |
+
* [I have enabled the `auto_save`/`auto_match` parameter while selecting and it got completely ignored with a warning message](#i-have-enabled-the-auto_saveauto_match-parameter-while-selecting-and-it-got-completely-ignored-with-a-warning-message)
|
| 48 |
+
* [I have done everything as the docs but the auto-matching didn't return anything, what's wrong?](#i-have-done-everything-as-the-docs-but-the-auto-matching-didnt-return-anything-whats-wrong)
|
| 49 |
+
* [Can Scrapling replace code built on top of BeautifulSoup4?](#can-scrapling-replace-code-built-on-top-of-beautifulsoup4)
|
| 50 |
+
* [Can Scrapling replace code built on top of AutoScraper?](#can-scrapling-replace-code-built-on-top-of-autoscraper)
|
| 51 |
+
* [Is Scrapling thread-safe?](#is-scrapling-thread-safe)
|
| 52 |
+
* [Sponsors](#sponsors)
|
| 53 |
+
* [Contributing](#contributing)
|
| 54 |
+
* [Disclaimer for Scrapling Project](#disclaimer-for-scrapling-project)
|
| 55 |
+
* [License](#license)
|
| 56 |
+
* [Acknowledgments](#acknowledgments)
|
| 57 |
+
* [Thanks and References](#thanks-and-references)
|
| 58 |
+
* [Known Issues](#known-issues)
|
| 59 |
+
|
| 60 |
## Key Features
|
| 61 |
|
| 62 |
+
### Fetch websites as you prefer
|
| 63 |
+
- **HTTP requests**: Stealthy and fast HTTP requests with `Fetcher`
|
| 64 |
+
- **Stealthy fetcher**: Annoying anti-bot protection? No problem! Scrapling can bypass almost all of them with `StealthyFetcher` with default configuration!
|
| 65 |
+
- **Your preferred browser**: Use your real browser with CDP, [NSTbrowser](https://app.nstbrowser.io/r/1vO5e5)'s browserless, PlayWright with stealth mode, or even vanilla PlayWright - All is possible with `PlayWrightFetcher`!
|
| 66 |
+
|
| 67 |
### Adaptive Scraping
|
| 68 |
- 🔄 **Smart Element Tracking**: Locate previously identified elements after website structure changes, using an intelligent similarity system and integrated storage.
|
| 69 |
+
- 🎯 **Flexible Querying**: Use CSS selectors, XPath, Elements filters, text search, or regex - chain them however you want!
|
| 70 |
- 🔍 **Find Similar Elements**: Automatically locate elements similar to the element you want on the page (Ex: other products like the product you found on the page).
|
| 71 |
+
- 🧠 **Smart Content Scraping**: Extract data from multiple websites without specific selectors using Scrapling powerful features.
|
| 72 |
|
| 73 |
### Performance
|
| 74 |
+
- 🚀 **Lightning Fast**: Built from the ground up with performance in mind, outperforming most popular Python scraping libraries (outperforming BeautifulSoup in parsing by up to 620x in our tests).
|
| 75 |
- 🔋 **Memory Efficient**: Optimized data structures for minimal memory footprint.
|
| 76 |
- ⚡ **Fast JSON serialization**: 10x faster JSON serialization than the standard json library with more options.
|
| 77 |
|
|
|
|
| 79 |
- 🛠️ **Powerful Navigation API**: Traverse the DOM tree easily in all directions and get the info you want (parent, ancestors, sibling, children, next/previous element, and more).
|
| 80 |
- 🧬 **Rich Text Processing**: All strings have built-in methods for regex matching, cleaning, and more. All elements' attributes are read-only dictionaries that are faster than standard dictionaries with added methods.
|
| 81 |
- 📝 **Automatic Selector Generation**: Create robust CSS/XPath selectors for any element.
|
| 82 |
+
- 🔌 **API Similar to Scrapy/BeautifulSoup**: Familiar methods and similar pseudo-elements for Scrapy and BeautifulSoup users.
|
| 83 |
+
- 📘 **Type hints and test coverage**: Complete type coverage and almost full test coverage for better IDE support and fewer bugs, respectively.
|
| 84 |
|
| 85 |
## Getting Started
|
| 86 |
|
|
|
|
|
|
|
| 87 |
```python
|
| 88 |
+
from scrapling import Fetcher
|
|
|
|
| 89 |
|
| 90 |
+
fetcher = Fetcher(auto_match=False)
|
|
|
|
|
|
|
| 91 |
|
| 92 |
+
# Fetch a web page and create an Adaptor instance
|
| 93 |
+
page = fetcher.get('https://quotes.toscrape.com/', stealthy_headers=True).adaptor
|
| 94 |
# Get all strings in the full page
|
| 95 |
page.get_all_text(ignore_tags=('script', 'style'))
|
| 96 |
|
|
|
|
| 98 |
quotes = page.css('.quote .text::text') # CSS selector
|
| 99 |
quotes = page.xpath('//span[@class="text"]/text()') # XPath
|
| 100 |
quotes = page.css('.quote').css('.text::text') # Chained selectors
|
| 101 |
+
quotes = [element.text for element in page.css('.quote .text')] # Slower than bulk query above
|
| 102 |
|
| 103 |
# Get the first quote element
|
| 104 |
+
quote = page.css_first('.quote') # / page.css('.quote').first / page.css('.quote')[0]
|
| 105 |
+
|
| 106 |
+
# Tired of selectors? Use find_all/find
|
| 107 |
+
quotes = page.find_all('div', {'class': 'quote'})
|
| 108 |
+
# Same as
|
| 109 |
+
quotes = page.find_all('div', class_='quote')
|
| 110 |
+
quotes = page.find_all(['div'], class_='quote')
|
| 111 |
+
quotes = page.find_all(class_='quote') # and so on...
|
| 112 |
|
| 113 |
# Working with elements
|
| 114 |
quote.html_content # Inner HTML
|
|
|
|
| 116 |
quote.attrib # Element attributes
|
| 117 |
quote.path # DOM path to element (List)
|
| 118 |
```
|
| 119 |
+
To keep it simple, all methods can be chained on top of each other!
|
|
|
|
| 120 |
|
| 121 |
+
## Parsing Performance
|
| 122 |
|
| 123 |
Scrapling isn't just powerful - it's also blazing fast. Scrapling implements many best practices, design patterns, and numerous optimizations to save fractions of seconds. All of that while focusing exclusively on parsing HTML documents.
|
| 124 |
Here are benchmarks comparing Scrapling to popular Python libraries in two tests.
|
|
|
|
| 194 |
python -m browserforge update
|
| 195 |
```
|
| 196 |
|
| 197 |
+
## Fetching Websites Features
|
| 198 |
+
All fetcher-type classes are imported with the same way
|
| 199 |
+
```python
|
| 200 |
+
from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
|
| 201 |
+
```
|
| 202 |
+
And all of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `storage`, `storage_args`, and `debug` which are the same ones you give to `Adaptor` class.
|
| 203 |
+
> [!NOTE]
|
| 204 |
+
> The `auto_match` argument is enabled by default which is the one you should care about the most as you will see later.
|
| 205 |
+
### Fetcher
|
| 206 |
+
This class is built on top of [httpx](https://www.python-httpx.org/) with some flavors, here you can do `GET`, `POST`, `PUT`, and `DELETE` requests.
|
| 207 |
+
|
| 208 |
+
For all methods, you have `stealth_headers` which makes `Fetcher` create and use real browser's headers then create a referer header as if this request came from Google's search of this URL's domain. It's enabled by default.
|
| 209 |
+
```python
|
| 210 |
+
>> page = Fetcher().get('https://httpbin.org/get', stealth_headers=True, follow_redirects=True)
|
| 211 |
+
>> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'})
|
| 212 |
+
>> page = Fetcher().put('https://httpbin.org/put', data={'key': 'value'})
|
| 213 |
+
>> page = Fetcher().delete('https://httpbin.org/delete')
|
| 214 |
+
```
|
| 215 |
+
### StealthyFetcher
|
| 216 |
+
This class is built on top of [Camoufox](https://github.com/daijro/camoufox) which is by default bypasses most of anti-bot protections. Scrapling adds extra layers of flavors and configurations to increase performance and undetectability even further.
|
| 217 |
+
```python
|
| 218 |
+
>> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection') # Running headless by default
|
| 219 |
+
>> page.status == 200
|
| 220 |
+
True
|
| 221 |
+
```
|
| 222 |
+
<details><summary><strong>For the sake of simplicity, expand this for the complete list of arguments</strong></summary>
|
| 223 |
+
|
| 224 |
+
| Argument | Description | Optional |
|
| 225 |
+
|:-------------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
|
| 226 |
+
| url | Target url | ❌ |
|
| 227 |
+
| headless | Pass `True` to run the browser in headless/hidden (**default**), `virtual` to run it in virtual screen mode, or `False` for headful/visible mode. The `virtual` mode requires having `xvfb` installed. | ✔️ |
|
| 228 |
+
| block_images | Prevent the loading of images through Firefox preferences. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | ✔️ |
|
| 229 |
+
| disable_resources | Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | ✔️ |
|
| 230 |
+
| google_search | Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. | ✔️ |
|
| 231 |
+
| extra_headers | A dictionary of extra headers to add with the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
|
| 232 |
+
| block_webrtc | Blocks WebRTC entirely. | ✔️ |
|
| 233 |
+
| page_action | Added for automation. A function that takes the `page` object, do the automation you need, then return `page` again. | ✔️ |
|
| 234 |
+
| addons | List of Firefox addons to use. **Must be paths to extracted addons.** | ✔️ |
|
| 235 |
+
| humanize | Humanize the cursor movement. Takes either True, or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window. | ✔️ |
|
| 236 |
+
| allow_webgl | Whether to allow WebGL. To prevent leaks, only use this for special cases. | ✔️ |
|
| 237 |
+
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
|
| 238 |
+
| timeout | The timeout in milliseconds that's used in all operations and waits through the page. Default is 30000. | ✔️ |
|
| 239 |
+
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
| 240 |
+
| wait_selector_state | The state to wait for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
| 241 |
+
|
| 242 |
+
</details>
|
| 243 |
+
|
| 244 |
+
This list isn't final so expect a lot more additions and flexibility to be added in the next versions!
|
| 245 |
+
|
| 246 |
+
### PlayWrightFetcher
|
| 247 |
+
This class is built on top of [Playwright](https://playwright.dev/python/) which currently provides 4 main run options but they can be mixed together as you want.
|
| 248 |
+
```python
|
| 249 |
+
>> page = PlayWrightFetcher().fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # Vanilla Playwright option
|
| 250 |
+
>> page.adaptor.css_first("#search a::attr(href)")
|
| 251 |
+
'https://github.com/D4Vinci/Scrapling'
|
| 252 |
+
```
|
| 253 |
+
Using this Fetcher class, you can do requests with:
|
| 254 |
+
1) Vanilla Playwright without any modifications other than the ones you chose.
|
| 255 |
+
2) Stealthy Playwright with the stealth mode I wrote for it. It's still a WIP but it bypasses many online tests like [Sannysoft's](https://bot.sannysoft.com/).</br> Some of the things this fetcher's stealth mode do includes:
|
| 256 |
+
* Patching the CDP runtime fingerprint.
|
| 257 |
+
* Mimics some of real browsers' properties by injects several JS files and using custom options.
|
| 258 |
+
* Using custom flags on launch to hide playwright even more and make it faster.
|
| 259 |
+
* Generates real browser's headers of the same type and same user OS then append it to the request's headers.
|
| 260 |
+
3) Real browsers by passing the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
|
| 261 |
+
4) [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option.
|
| 262 |
+
|
| 263 |
+
Add that to a lot of controlling/hiding options as you will see in the arguments list below.
|
| 264 |
+
|
| 265 |
+
<details><summary><strong>Expand this for the complete list of arguments</strong></summary>
|
| 266 |
+
|
| 267 |
+
| Argument | Description | Optional |
|
| 268 |
+
|:-------------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
|
| 269 |
+
| url | Target url | ❌ |
|
| 270 |
+
| headless | Pass `True` to run the browser in headless/hidden (**default**), or `False` for headful/visible mode. | ✔️ |
|
| 271 |
+
| disable_resources | Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | ✔️ |
|
| 272 |
+
| useragent | Pass a useragent string to be used. **Otherwise the fetcher will generate a real Useragent of the same browser and use it.** | ✔️ |
|
| 273 |
+
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
|
| 274 |
+
| timeout | The timeout in milliseconds that's used in all operations and waits through the page. Default is 30000. | ✔️ |
|
| 275 |
+
| page_action | Added for automation. A function that takes the `page` object, do the automation you need, then return `page` again. | ✔️ |
|
| 276 |
+
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
| 277 |
+
| wait_selector_state | The state to wait for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
| 278 |
+
| google_search | Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. | ✔️ |
|
| 279 |
+
| extra_headers | A dictionary of extra headers to add with the request. The referer set by the `google_search` argument takes priority over the referer set here if used together. | ✔️ |
|
| 280 |
+
| hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
|
| 281 |
+
| disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | ✔️ |
|
| 282 |
+
| stealth | Enables stealth mode, always check the documentation to see what stealth mode does currently. | ✔️ |
|
| 283 |
+
| cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP. | ✔️ |
|
| 284 |
+
| nstbrowser_mode | Enables NSTBrowser mode, **it have to be used with `cdp_url` argument or it will get completely ignored.** | ✔️ |
|
| 285 |
+
| nstbrowser_config | The config you want to send with requests to the NSTBrowser. _If left empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config._ | ✔️ |
|
| 286 |
+
|
| 287 |
+
</details>
|
| 288 |
+
|
| 289 |
+
This list isn't final so expect a lot more additions and flexibility to be added in the next versions!
|
| 290 |
+
|
| 291 |
+
## Advanced Parsing Features
|
| 292 |
### Smart Navigation
|
| 293 |
```python
|
| 294 |
>>> quote.tag
|
|
|
|
| 308 |
>>> quote.siblings
|
| 309 |
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 310 |
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
|
|
|
| 311 |
...]
|
| 312 |
|
| 313 |
>>> quote.next # gets the next element, the same logic applies to `quote.previous`
|
| 314 |
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>
|
| 315 |
|
| 316 |
+
>>> quote.children.css_first(".author::text")
|
| 317 |
+
'Albert Einstein'
|
| 318 |
|
| 319 |
>>> quote.has_class('quote')
|
| 320 |
True
|
|
|
|
| 323 |
>>> quote.css_selector
|
| 324 |
'body > div > div:nth-of-type(2) > div > div'
|
| 325 |
|
| 326 |
+
# Test these selectors on your favorite browser or reuse them again in the library's methods!
|
| 327 |
>>> quote.xpath_selector
|
| 328 |
'//body/div/div[2]/div/div'
|
| 329 |
```
|
|
|
|
| 341 |
### Content-based Selection & Finding Similar Elements
|
| 342 |
You can select elements by their text content in multiple ways, here's a full example on another website:
|
| 343 |
```python
|
| 344 |
+
>>> page = Fetcher().get('https://books.toscrape.com/index.html').adaptor
|
|
|
|
|
|
|
| 345 |
|
| 346 |
>>> page.find_by_text('Tipping the Velvet') # Find the first element that its text fully matches this text
|
| 347 |
<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>
|
|
|
|
| 395 |
The [documentation](https://github.com/D4Vinci/Scrapling/tree/main/docs/Examples) will provide more advanced examples.
|
| 396 |
|
| 397 |
### Handling Structural Changes
|
|
|
|
|
|
|
| 398 |
Let's say you are scraping a page with a structure like this:
|
| 399 |
```html
|
| 400 |
<div class="container">
|
|
|
|
| 438 |
The selector will no longer function and your code needs maintenance. That's where Scrapling auto-matching feature comes into play.
|
| 439 |
|
| 440 |
```python
|
| 441 |
+
from scrapling import Adaptor
|
| 442 |
# Before the change
|
| 443 |
+
page = Adaptor(page_source, url='example.com')
|
| 444 |
element = page.css('#p1' auto_save=True)
|
| 445 |
if not element: # One day website changes?
|
| 446 |
+
element = page.css('#p1', auto_match=True) # Scrapling still finds it!
|
| 447 |
# the rest of the code...
|
| 448 |
```
|
| 449 |
> How does the auto-matching work? Check the [FAQs](#-enlightening-questions-and-faqs) section for that and other possible issues while auto-matching.
|
| 450 |
|
| 451 |
+
#### Real World Scenario
|
| 452 |
+
Let's use a real website as an example and use one of the fetchers to fetch its source. To do this we need to find a website that will change its design/structure soon and take a copy of its source then wait for the website to do the change. Of course, that's nearly impossible to know unless I know website's owner but that will make it a staged test haha.
|
| 453 |
+
|
| 454 |
+
To solve this issue, I will use [The Web Archive](https://archive.org/)'s [Wayback Machine](https://web.archive.org/). Here is a copy of [StackOverFlow's website in 2010](https://web.archive.org/web/20100102003420/http://stackoverflow.com/), pretty old huh?</br>Let's test if the automatch feature can extract the same button in the old design from 2010 and the current design using the same selector :)
|
| 455 |
+
|
| 456 |
+
If I want to extract the Questions button from the old design I can use a selector like this `#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a` this selector is too specific because it was generated by Google Chrome.
|
| 457 |
+
Now let's test the same selector in both versions
|
| 458 |
+
```python
|
| 459 |
+
>> from scrapling import Fetcher
|
| 460 |
+
>> selector = '#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a'
|
| 461 |
+
>> old_url = "https://web.archive.org/web/20100102003420/http://stackoverflow.com/"
|
| 462 |
+
>> new_url = "https://stackoverflow.com/"
|
| 463 |
+
>>
|
| 464 |
+
>> page = Fetcher(automatch_domain='stackoverflow.com').get(old_url, timeout=30).adaptor
|
| 465 |
+
>> element1 = page.css_first(selector, auto_save=True)
|
| 466 |
+
>>
|
| 467 |
+
>> # Same selector but used in the updated website
|
| 468 |
+
>> page = Fetcher(automatch_domain="stackoverflow.com").get(new_url).adaptor
|
| 469 |
+
>> element2 = page.css_first(selector, auto_match=True)
|
| 470 |
+
>>
|
| 471 |
+
>> if element1.text == element2.text:
|
| 472 |
+
... print('Scrapling found the same element in the old design and the new design!')
|
| 473 |
+
'Scrapling found the same element in the old design and the new design!'
|
| 474 |
+
```
|
| 475 |
+
Note that I used a new argument called `automatch_domain`, this because for Scrapling these are two different URLs not the website so it isolates their data. To tell Scrapling they are the same website, we the pass the domain we want to use for saving auto-match data for them both so Scrapling doesn't isolate them.
|
| 476 |
+
|
| 477 |
+
In real world scenario, the code will be the same expect it will use the same URL for both requests so you won't need to use `automatch_domain` argument. This is the closest example I can give to real world cases so I hope it didn't confuse you :)
|
| 478 |
+
|
| 479 |
**Notes:**
|
| 480 |
+
1. For the two examples above I used one time the `Adaptor` class and the second time the `Fetcher` class just to show you that you can create the `Adaptor` object by yourself if you have the source or fetch the source using any `Fetcher` class then it will create the `Adaptor` object for you on the `.adaptor` property.
|
| 481 |
+
2. Passing the `auto_save` argument with the `auto_match` argument set to `False` while initializing the Adaptor/Fetcher object will only result in ignoring the `auto_save` argument value and the following warning message
|
| 482 |
```text
|
| 483 |
Argument `auto_save` will be ignored because `auto_match` wasn't enabled on initialization. Check docs for more info.
|
| 484 |
```
|
| 485 |
This behavior is purely for performance reasons so the database gets created/connected only when you are planning to use the auto-matching features. Same case with the `auto_match` argument.
|
| 486 |
|
| 487 |
+
3. The `auto_match` parameter works only for `Adaptor` instances not `Adaptors` so if you do something like this you will get an error
|
| 488 |
```python
|
| 489 |
page.css('body').css('#p1', auto_match=True)
|
| 490 |
```
|
| 491 |
because you can't auto-match a whole list, you have to be specific and do something like
|
| 492 |
```python
|
| 493 |
+
page.css_first('body').css('#p1', auto_match=True)
|
| 494 |
```
|
| 495 |
|
| 496 |
+
### Find elements by filters
|
| 497 |
+
Inspired by BeautifulSoup's `find_all` function you can find elements by using `find_all`/`find` methods. Both methods can take multiple types of filters and returns all elements in the pages that all these filters apply to.
|
| 498 |
+
|
| 499 |
+
* To be more specific:
|
| 500 |
+
* Any string passed is considered a tag name
|
| 501 |
+
* Any iterable passed like List/Tuple/Set is considered an iterable of tag names.
|
| 502 |
+
* Any dictionary is considered a mapping of HTML element(s) attributes name and attributes values.
|
| 503 |
+
* Any regex patterns passed are used as filters
|
| 504 |
+
* Any functions passed are used as filters
|
| 505 |
+
* Any keyword argument passed is considered as a HTML element attribute with its value.
|
| 506 |
+
|
| 507 |
+
So the way it works is after collecting all passed arguments and keywords, each filter passes its results to the following filter in a waterfall-like filtering system.
|
| 508 |
+
<br/>It filters all elements in the current page/element in the following order:
|
| 509 |
+
|
| 510 |
+
1. All elements with the passed tag name(s).
|
| 511 |
+
2. All elements that matches all passed attribute(s).
|
| 512 |
+
3. All elements that matches all passed regex patterns.
|
| 513 |
+
4. All elements that fulfills all passed function(s).
|
| 514 |
+
|
| 515 |
+
Note: The filtering process always starts from the first filter it finds in the filtering order above so if no tag name(s) are passed but attributes passed, the process starts from that layer and so one. **But the order in which you pass the arguments doesn't matter.**
|
| 516 |
+
|
| 517 |
+
Examples to clear any confusion :)
|
| 518 |
+
|
| 519 |
+
```python
|
| 520 |
+
>> from scrapling import Fetcher
|
| 521 |
+
>> page = Fetcher().get('https://quotes.toscrape.com/').adaptor
|
| 522 |
+
# Find all elements with tag name `div`.
|
| 523 |
+
>> page.find_all('div')
|
| 524 |
+
[<data='<div class="container"> <div class="row...' parent='<body> <div class="container"> <div clas...'>,
|
| 525 |
+
<data='<div class="row header-box"> <div class=...' parent='<div class="container"> <div class="row...'>,
|
| 526 |
+
...]
|
| 527 |
+
|
| 528 |
+
# Find all div elements with class that equals `quote`.
|
| 529 |
+
>> page.find_all('div', class_='quote')
|
| 530 |
+
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 531 |
+
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 532 |
+
...]
|
| 533 |
+
|
| 534 |
+
# Same as above.
|
| 535 |
+
>> page.find_all('div', {'class': 'quote'})
|
| 536 |
+
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 537 |
+
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 538 |
+
...]
|
| 539 |
+
|
| 540 |
+
# Find all elements with class that equals `quote`.
|
| 541 |
+
>> page.find_all({'class': 'quote'})
|
| 542 |
+
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 543 |
+
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 544 |
+
...]
|
| 545 |
+
|
| 546 |
+
# Find all div elements with class that equals `quote`, and contains the element `.text` which contains the word 'world' in its content.
|
| 547 |
+
>> page.find_all('div', {'class': 'quote'}, lambda e: "world" in e.css_first('.text::text'))
|
| 548 |
+
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>]
|
| 549 |
+
|
| 550 |
+
# Find all elements that contains the word 'world' in its content.
|
| 551 |
+
>> page.find_all(lambda e: "world" in e.text)
|
| 552 |
+
[<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>,
|
| 553 |
+
<data='<a class="tag" href="/tag/world/page/1/"...' parent='<div class="tags"> Tags: <meta class="ke...'>]
|
| 554 |
+
|
| 555 |
+
# Find all span elements that matches the given regex
|
| 556 |
+
>> page.find_all('span', re.compile(r'world'))
|
| 557 |
+
[<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>]
|
| 558 |
+
|
| 559 |
+
# Find all div and span elements with class 'quote' (No span elements like that so only div returned)
|
| 560 |
+
>> page.find_all(['div', 'span'], {'class': 'quote'})
|
| 561 |
+
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 562 |
+
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 563 |
+
...]
|
| 564 |
+
|
| 565 |
+
# Mix things up
|
| 566 |
+
>> page.find_all({'itemtype':"http://schema.org/CreativeWork"}, 'div').css('.author::text')
|
| 567 |
+
['Albert Einstein',
|
| 568 |
+
'J.K. Rowling',
|
| 569 |
+
...]
|
| 570 |
+
```
|
| 571 |
+
|
| 572 |
### Is That All?
|
| 573 |
Here's what else you can do with Scrapling:
|
| 574 |
|
|
|
|
| 598 |
[<Element a at 0x105a2a7b0>]
|
| 599 |
```
|
| 600 |
|
| 601 |
+
- Filtering results based on a function
|
| 602 |
+
```python
|
| 603 |
+
# Find all products over $50
|
| 604 |
+
expensive_products = page.css('.product_pod').filter(
|
| 605 |
+
lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) > 50
|
| 606 |
+
)
|
| 607 |
+
```
|
| 608 |
+
|
| 609 |
+
- Searching results for the first one that matches a function
|
| 610 |
+
```python
|
| 611 |
+
# Find all the product with price '53.23'
|
| 612 |
+
page.css('.product_pod').search(
|
| 613 |
+
lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) == 54.23
|
| 614 |
+
)
|
| 615 |
+
```
|
| 616 |
+
|
| 617 |
- Doing operations on element content is the same as scrapy
|
| 618 |
```python
|
| 619 |
+
quote.re(r'regex_pattern') # Get all strings (TextHandlers) that match the regex pattern
|
| 620 |
+
quote.re_first(r'regex_pattern') # Get the first string (TextHandler) only
|
| 621 |
quote.json() # If the content text is jsonable, then convert it to json using `orjson` which is 10x faster than the standard json library and provides more options
|
| 622 |
```
|
| 623 |
+
except that you can do more with them like
|
| 624 |
+
```python
|
| 625 |
+
quote.re(
|
| 626 |
+
r'regex_pattern',
|
| 627 |
+
replace_entities=True, # Character entity references are replaced by their corresponding character
|
| 628 |
+
clean_match=True, # This will ignore all whitespaces and consecutive spaces while matching
|
| 629 |
+
case_sensitive= False, # Set the regex to ignore letters case while compiling it
|
| 630 |
+
)
|
| 631 |
+
```
|
| 632 |
Hence all of these methods are actually methods from the `TextHandler` within that contains the text content so the same can be done directly if you call the `.text` property or equivalent selector function.
|
| 633 |
|
| 634 |
|
|
|
|
| 641 |
```python
|
| 642 |
page.xpath('//script/text()').re_first(r'var dataLayer = (.+);').json()
|
| 643 |
```
|
| 644 |
+
- Sort all characters in the string as if it was a list and return the new string
|
| 645 |
```python
|
| 646 |
+
quote.sort(reverse=False)
|
| 647 |
```
|
| 648 |
> To be clear, `TextHandler` is a sub-class of Python's `str` so all normal operations/methods that work with Python strings will work with it.
|
| 649 |
|
|
|
|
| 674 |
|
| 675 |
Note that implementing your storage system can be complex as there are some strict rules such as inheriting from the same abstract class, following the singleton design pattern used in other classes, and more. So make sure to read the docs first.
|
| 676 |
|
| 677 |
+
To give a detailed documentation of the library, it will need a website. I'm trying to rush creating the website, researching new ideas, and add more features/tests/benchmarks but time is tight with too many spinning plates between work, personal life, and working on Scrapling. But you can help by using the [sponsor button](https://github.com/sponsors/D4Vinci) above :)
|
| 678 |
|
| 679 |
## ⚡ Enlightening Questions and FAQs
|
| 680 |
This section addresses common questions about Scrapling, please read this section before opening an issue.
|
|
|
|
| 729 |
Please read the [contributing file](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md) before doing anything.
|
| 730 |
|
| 731 |
## Disclaimer for Scrapling Project
|
| 732 |
+
> [!CAUTION]
|
| 733 |
> This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international laws regarding data scraping and privacy. The authors and contributors are not responsible for any misuse of this software. This library should not be used to violate the rights of others, for unethical purposes, or to use data in an unauthorized or illegal manner. Do not use it on any website unless you have permission from the website owner or within their allowed rules like `robots.txt` file, for example.
|
| 734 |
|
| 735 |
## License
|
|
|
|
| 740 |
- Parsel (BSD License) - Used for [translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/translator.py) submodule
|
| 741 |
|
| 742 |
## Thanks and References
|
| 743 |
+
- [Daijro](https://github.com/daijro)'s brilliant work on both [BrowserForge](https://github.com/daijro/browserforge) and [Camoufox](https://github.com/daijro/camoufox)
|
| 744 |
+
- [Vinyzu](https://github.com/Vinyzu)'s work on Playwright's mock on [Botright](https://github.com/Vinyzu/Botright)
|
| 745 |
- [brotector](https://github.com/kaliiiiiiiiii/brotector)
|
| 746 |
- [fakebrowser](https://github.com/kkoooqq/fakebrowser)
|
| 747 |
- [rebrowser-patches](https://github.com/rebrowser/rebrowser-patches)
|
|
|
|
|
|
|
| 748 |
|
| 749 |
## Known Issues
|
| 750 |
- In the auto-matching save process, the unique properties of the first element from the selection results are the only ones that get saved. So if the selector you are using selects different elements on the page that are in different locations, auto-matching will probably return to you the first element only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector for example) as these selectors get separated and each selector gets executed alone.
|