Karim shoair commited on
Commit ·
7f474cb
1
Parent(s): 3df2133
docs: Update the 'main classes' page
Browse files- docs/parsing/main_classes.md +92 -72
docs/parsing/main_classes.md
CHANGED
|
@@ -1,41 +1,41 @@
|
|
| 1 |
## Introduction
|
| 2 |
-
After exploring the various ways to select elements with Scrapling and related features,
|
| 3 |
|
| 4 |
-
The [
|
| 5 |
```python
|
| 6 |
-
from scrapling import
|
| 7 |
-
from scrapling.parser import
|
| 8 |
```
|
| 9 |
-
|
| 10 |
```python
|
| 11 |
-
|
| 12 |
-
|
| 13 |
url='https://example.com'
|
| 14 |
)
|
| 15 |
|
| 16 |
# Then select elements as you like
|
| 17 |
-
elements =
|
| 18 |
```
|
| 19 |
-
In Scrapling, the main object you deal with after passing an HTML source or fetching a website is, of course,
|
| 20 |
|
| 21 |
-
In other words, the main page is a [
|
| 22 |
|
| 23 |
-
##
|
| 24 |
### Arguments explained
|
| 25 |
-
The most important
|
| 26 |
|
| 27 |
-
Otherwise, you have the arguments `url`, `
|
| 28 |
|
| 29 |
-
Then you have the arguments for
|
| 30 |
|
| 31 |
- **encoding**: This is the encoding that will be used while parsing the HTML. The default is `UTF-8`.
|
| 32 |
-
- **keep_comments**: This tells the library whether to keep HTML comments while parsing the page. It's disabled by default
|
| 33 |
-
- **keep_cdata**: Same logic as the HTML comments. [cdata](https://stackoverflow.com/questions/7092236/what-is-cdata-in-html) is removed by default for cleaner HTML.
|
| 34 |
|
| 35 |
I have intended to ignore the arguments `huge_tree` and `root` to avoid making this page more complicated than needed.
|
| 36 |
-
You may notice that I'm doing that a lot
|
| 37 |
|
| 38 |
-
After that, for the main page and elements within, most properties don't get initialized until you use
|
| 39 |
|
| 40 |
### Properties
|
| 41 |
You have already seen much of this on the [overview](../overview.md) page, but don't worry if you didn't. We will review it more thoroughly using more advanced methods/usages. For clarity, the properties for traversal are separated below in the [traversal](#traversal) section.
|
|
@@ -81,15 +81,15 @@ Let's say we are parsing this HTML page for simplicity:
|
|
| 81 |
```
|
| 82 |
Load the page directly as shown before:
|
| 83 |
```python
|
| 84 |
-
from scrapling import
|
| 85 |
-
page =
|
| 86 |
```
|
| 87 |
Get all text content on the page recursively
|
| 88 |
```python
|
| 89 |
>>> page.get_all_text()
|
| 90 |
'Some page\n\n \n\n \nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock'
|
| 91 |
```
|
| 92 |
-
Get the first article as explained before; we will use it as an example
|
| 93 |
```python
|
| 94 |
article = page.find('article')
|
| 95 |
```
|
|
@@ -98,7 +98,7 @@ With the same logic, get all text content on the element recursively
|
|
| 98 |
>>> article.get_all_text()
|
| 99 |
'Product 1\nThis is product 1\n$10.99\nIn stock: 5'
|
| 100 |
```
|
| 101 |
-
But if you try to get the direct text content, it will be empty
|
| 102 |
```python
|
| 103 |
>>> article.text
|
| 104 |
''
|
|
@@ -107,10 +107,10 @@ The `get_all_text` method has the following optional arguments:
|
|
| 107 |
|
| 108 |
1. **separator**: All strings collected will be concatenated using this separator. The default is '\n'
|
| 109 |
2. **strip**: If enabled, strings will be stripped before concatenation. Disabled by default.
|
| 110 |
-
3. **ignore_tags**: A tuple of all tag names you want to ignore in the final results. The default is `('script', 'style',)`.
|
| 111 |
4. **valid_values**: If enabled, the method will only collect elements with real values, so all elements with empty text content or only whitespaces will be ignored. It's enabled by default
|
| 112 |
|
| 113 |
-
By the way, the text returned here is not a standard string but a [TextHandler](#texthandler); we will get to this in detail later, so if the text content can be serialized to JSON,
|
| 114 |
```python
|
| 115 |
>>> script = page.find('script')
|
| 116 |
>>> script.json()
|
|
@@ -121,7 +121,7 @@ Let's continue to get the element tag
|
|
| 121 |
>>> article.tag
|
| 122 |
'article'
|
| 123 |
```
|
| 124 |
-
If you
|
| 125 |
```python
|
| 126 |
>>> page.tag
|
| 127 |
'html'
|
|
@@ -133,6 +133,17 @@ Getting the attributes of the element
|
|
| 133 |
>>> print(article.attrib)
|
| 134 |
{'class': 'product', 'data-id': '1'}
|
| 135 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
Get the HTML content of the element
|
| 137 |
```python
|
| 138 |
>>> article.html_content
|
|
@@ -143,7 +154,7 @@ It's the same if you used the `.body` property
|
|
| 143 |
>>> article.body
|
| 144 |
'<article class="product" data-id="1"><h3>Product 1</h3>\n <p class="description">This is product 1</p>\n <span class="price">$10.99</span>\n <div class="hidden stock">In stock: 5</div>\n </article>'
|
| 145 |
```
|
| 146 |
-
Get the prettified version of the HTML content
|
| 147 |
```python
|
| 148 |
>>> print(article.prettify())
|
| 149 |
<article class="product" data-id="1"><h3>Product 1</h3>
|
|
@@ -175,12 +186,12 @@ Same case with XPath
|
|
| 175 |
```
|
| 176 |
|
| 177 |
### Traversal
|
| 178 |
-
Using the elements we found above, we will go over the properties/methods for moving
|
| 179 |
|
| 180 |
If you are unfamiliar with the DOM tree or the tree data structure in general, the following traversal part can be confusing. I recommend you look up these concepts online for a better understanding.
|
| 181 |
|
| 182 |
If you are too lazy to search about it, here's a quick explanation to give you a good idea.<br/>
|
| 183 |
-
|
| 184 |
This element will be directly above elements like `head` and `body`. These are considered "children" of the `html` element, and the `html` element is considered their "parent." The element `body` is a "sibling" of the element `head` and vice versa.
|
| 185 |
|
| 186 |
Accessing the parent of an element
|
|
@@ -238,7 +249,7 @@ Get the siblings of an element
|
|
| 238 |
```
|
| 239 |
Get the next element of the current element
|
| 240 |
```python
|
| 241 |
-
>>> article.next
|
| 242 |
<data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>
|
| 243 |
```
|
| 244 |
The same logic applies to the `previous` property
|
|
@@ -258,7 +269,7 @@ If your case needs more than the element's parent, you can iterate over the whol
|
|
| 258 |
for ancestor in article.iterancestors():
|
| 259 |
# do something with it...
|
| 260 |
```
|
| 261 |
-
You can search for a specific ancestor of an element that satisfies a function; all you need to do is to pass a function that takes
|
| 262 |
```python
|
| 263 |
>>> article.find_ancestor(lambda ancestor: ancestor.has_class('product-list'))
|
| 264 |
<data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
|
|
@@ -266,10 +277,10 @@ You can search for a specific ancestor of an element that satisfies a function;
|
|
| 266 |
>>> article.find_ancestor(lambda ancestor: ancestor.css('.product-list')) # Same result, different approach
|
| 267 |
<data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
|
| 268 |
```
|
| 269 |
-
##
|
| 270 |
-
The class `
|
| 271 |
|
| 272 |
-
In the [
|
| 273 |
|
| 274 |
- If you selected a text node with the selector, then the return type will be [TextHandler](#texthandler)/[TextHandlers](#texthandlers). <br/>Examples:
|
| 275 |
```python
|
|
@@ -284,18 +295,18 @@ In the [Adaptor](#adaptor) class, all methods/properties that should return a gr
|
|
| 284 |
```
|
| 285 |
- If you used a combined selector that returns mixed types, the result will be a Python standard `List`. <br/>Examples:
|
| 286 |
```python
|
| 287 |
-
>>> page.css('.price_color') # ->
|
| 288 |
>>> page.css('.product_pod a::attr(href)') # -> TextHandlers
|
| 289 |
>>> page.css('.price_color, .product_pod a::attr(href)') # -> List
|
| 290 |
```
|
| 291 |
|
| 292 |
-
Let's see what [
|
| 293 |
### Properties
|
| 294 |
Apart from the normal operations on Python lists like iteration, slicing, etc...
|
| 295 |
|
| 296 |
You can do the following:
|
| 297 |
|
| 298 |
-
Execute CSS and XPath selectors directly on the [
|
| 299 |
```python
|
| 300 |
>>> page.css('.product_pod a')
|
| 301 |
[<data='<a href="catalogue/a-light-in-the-attic_...' parent='<div class="image_container"> <a href="c...'>,
|
|
@@ -315,9 +326,9 @@ Execute CSS and XPath selectors directly on the [Adaptor](#adaptor) instances it
|
|
| 315 |
<data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
|
| 316 |
...]
|
| 317 |
```
|
| 318 |
-
Run the `re` and `re_first` methods directly. They take the same arguments passed
|
| 319 |
|
| 320 |
-
However, in this class, the `re_first` behaves differently as it runs `re` on each [
|
| 321 |
```python
|
| 322 |
>>> page.css('.price_color').re(r'[\d\.]+')
|
| 323 |
['51.77',
|
|
@@ -334,14 +345,14 @@ However, in this class, the `re_first` behaves differently as it runs `re` on ea
|
|
| 334 |
'sharp-objects_997',
|
| 335 |
...]
|
| 336 |
```
|
| 337 |
-
With the `search` method, you can search quickly in the available [
|
| 338 |
```python
|
| 339 |
# Find all the products with price '53.23'
|
| 340 |
>>> search_function = lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) == 54.23
|
| 341 |
>>> page.css('.product_pod').search(search_function)
|
| 342 |
<data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>
|
| 343 |
```
|
| 344 |
-
You can use the `filter` method, too, which takes a function like the `search` method but returns an `
|
| 345 |
```python
|
| 346 |
# Find all products with prices over $50
|
| 347 |
>>> filtering_function = lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) > 50
|
|
@@ -351,26 +362,35 @@ You can use the `filter` method, too, which takes a function like the `search` m
|
|
| 351 |
<data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
|
| 352 |
...]
|
| 353 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 354 |
|
| 355 |
## TextHandler
|
| 356 |
This class is mandatory to understand, as all methods/properties that should return a string for you will return `TextHandler`, and the ones that should return a list of strings will return [TextHandlers](#texthandlers) instead.
|
| 357 |
|
| 358 |
-
TextHandler is a subclass of the standard Python string, so you can do anything with it. So, what is the difference that requires a different naming?
|
| 359 |
|
| 360 |
-
Of course, TextHandler provides extra methods and properties that
|
| 361 |
### Usage
|
| 362 |
-
First, before discussing the added methods, you need to know that all operations on it, like slicing, accessing by index, etc., and methods like `split`, `replace`, `strip`, etc., all return a TextHandler again, so you can chain them as you want. If you find a method or property that returns a standard string instead of TextHandler, please open an issue, and we will override it as well.
|
| 363 |
|
| 364 |
-
First, we start with the `re` and `re_first` methods. These are the same methods that exist in the rest of the classes ([
|
| 365 |
|
| 366 |
-
|
| 367 |
|
| 368 |
Also, it takes other helpful arguments, which are:
|
| 369 |
|
| 370 |
- **replace_entities**: This is enabled by default. It replaces character entity references with their corresponding characters.
|
| 371 |
-
|
| 372 |
-
|
| 373 |
-
|
| 374 |
You have seen these examples before; the return result is [TextHandlers](#texthandlers) because we used the `re` method.
|
| 375 |
```python
|
| 376 |
>>> page.css('.price_color').re(r'[\d\.]+')
|
|
@@ -405,25 +425,25 @@ First, we start with the `re` and `re_first` methods. These are the same methods
|
|
| 405 |
>>> test_string.re('hi there', clean_match=True, case_sensitive=False)
|
| 406 |
['hi There']
|
| 407 |
```
|
| 408 |
-
Another use of the idea of replacing strings with `TextHandler` everywhere is a property like `html_content` returns `TextHandler` so you can do regex on the HTML content if you want:
|
| 409 |
```python
|
| 410 |
>>> page.html_content.re('div class=".*">(.*)</div')
|
| 411 |
['In stock: 5', 'In stock: 3', 'Out of stock']
|
| 412 |
```
|
| 413 |
|
| 414 |
-
- You also have the `.json()` method, which tries to convert the content to a
|
| 415 |
```python
|
| 416 |
>>> page.css_first('#page-data::text')
|
| 417 |
'\n {\n "lastUpdated": "2024-09-22T10:30:00Z",\n "totalProducts": 3\n }\n '
|
| 418 |
>>> page.css_first('#page-data::text').json()
|
| 419 |
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
|
| 420 |
```
|
| 421 |
-
Hence, if you didn't specify a text node while selecting an element (like the text content or an attribute text content), the text content will be selected automatically like this
|
| 422 |
```python
|
| 423 |
>>> page.css_first('#page-data').json()
|
| 424 |
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
|
| 425 |
```
|
| 426 |
-
The [
|
| 427 |
```html
|
| 428 |
<html>
|
| 429 |
<body>
|
|
@@ -438,42 +458,42 @@ First, we start with the `re` and `re_first` methods. These are the same methods
|
|
| 438 |
</body>
|
| 439 |
</html>
|
| 440 |
```
|
| 441 |
-
The [
|
| 442 |
So, as you know here, if you did something like this
|
| 443 |
```python
|
| 444 |
>>> page.css_first('div::text').json()
|
| 445 |
```
|
| 446 |
-
You will get an error because the `div` tag doesn't have direct text content that can be serialized to JSON; it actually doesn't have text content at all.<br/><br/>
|
| 447 |
In this case, the `get_all_text` method comes to the rescue, so you can do something like that
|
| 448 |
```python
|
| 449 |
>>> page.css_first('div').get_all_text(ignore_tags=[]).json()
|
| 450 |
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
|
| 451 |
```
|
| 452 |
I used the `ignore_tags` argument here because the default value of it is `('script', 'style',)`, as you are aware.<br/><br/>
|
| 453 |
-
Another related behavior
|
| 454 |
```python
|
| 455 |
-
>>> page =
|
| 456 |
```
|
| 457 |
-
Because the [
|
| 458 |
```python
|
| 459 |
>>> page.html_content
|
| 460 |
'<html><body><p>{"some_key": "some_value"}</p></body></html>'
|
| 461 |
```
|
| 462 |
-
Here, you can use `json` method directly, and it will work
|
| 463 |
```python
|
| 464 |
>>> page.json()
|
| 465 |
{'some_key': 'some_value'}
|
| 466 |
```
|
| 467 |
-
You might wonder how this happened while the `html` tag
|
| 468 |
-
Well, for
|
| 469 |
|
| 470 |
-
- Another handy method is `.clean()`,
|
| 471 |
```python
|
| 472 |
>>> TextHandler('\n wonderful idea, \reh?').clean()
|
| 473 |
'wonderful idea, eh?'
|
| 474 |
```
|
| 475 |
|
| 476 |
-
- Another method that might be helpful in some cases is the `.sort()` method to sort the string for you as you do with lists
|
| 477 |
```python
|
| 478 |
>>> TextHandler('acb').sort()
|
| 479 |
'abc'
|
|
@@ -487,19 +507,19 @@ Or do it in reverse:
|
|
| 487 |
Other methods and properties will be added over time, but remember that this class is returned in place of strings nearly everywhere in the library.
|
| 488 |
|
| 489 |
## TextHandlers
|
| 490 |
-
You probably guessed it: This class is similar to [
|
| 491 |
|
| 492 |
-
The only difference is that the `re_first` method logic here does `re` on each [TextHandler](#texthandler) within and returns the first result it has or `None`. Nothing is new to explain here, but new methods will be added
|
| 493 |
|
| 494 |
## AttributesHandler
|
| 495 |
-
This is a read-only version of Python's standard dictionary or `dict` that's only used to store the attributes of each element or each [
|
| 496 |
```python
|
| 497 |
>>> print(page.find('script').attrib)
|
| 498 |
{'id': 'page-data', 'type': 'application/json'}
|
| 499 |
>>> type(page.find('script').attrib).__name__
|
| 500 |
'AttributesHandler'
|
| 501 |
```
|
| 502 |
-
Because it's read-only, it will use fewer resources than the standard dictionary. Still, it has the same dictionary method
|
| 503 |
|
| 504 |
It currently adds two extra simple methods:
|
| 505 |
|
|
@@ -530,10 +550,10 @@ It currently adds two extra simple methods:
|
|
| 530 |
|
| 531 |
Hence, I used the `list` function here because `search_values` returns a generator, so it would be `True` for all elements.
|
| 532 |
|
| 533 |
-
|
| 534 |
|
| 535 |
-
|
| 536 |
-
|
| 537 |
-
>>>
|
| 538 |
-
|
| 539 |
-
|
|
|
|
| 1 |
## Introduction
|
| 2 |
+
After exploring the various ways to select elements with Scrapling and related features, let's take a step back and examine the [Selector](#selector) class generally and other objects to better understand the parsing engine.
|
| 3 |
|
| 4 |
+
The [Selector](#selector) class is the core parsing engine in Scrapling that provides HTML parsing and element selection capabilities. You can always import it with any of the following imports
|
| 5 |
```python
|
| 6 |
+
from scrapling import Selector
|
| 7 |
+
from scrapling.parser import Selector
|
| 8 |
```
|
| 9 |
+
Then use it directly as you already learned in the [overview](../overview.md) page
|
| 10 |
```python
|
| 11 |
+
page = Selector(
|
| 12 |
+
'<html>...</html>',
|
| 13 |
url='https://example.com'
|
| 14 |
)
|
| 15 |
|
| 16 |
# Then select elements as you like
|
| 17 |
+
elements = page.css('.product')
|
| 18 |
```
|
| 19 |
+
In Scrapling, the main object you deal with after passing an HTML source or fetching a website is, of course, a [Selector](#selector) object. Any operation you do, like selection, navigation, etc., will return either a [Selector](#selector) object or a [Selectors](#selectors) object, given that the result is element/elements from the page, not text or similar.
|
| 20 |
|
| 21 |
+
In other words, the main page is a [Selector](#selector) object, and the elements within are [Selector](#selector) objects, and so on. Any text, such as the text content inside elements or the text inside element attributes, is a [TextHandler](#texthandler) object, and the attributes of each element are stored as [AttributesHandler](#attributeshandler). We will return to both objects later, so let's focus on the [Selector](#selector) object.
|
| 22 |
|
| 23 |
+
## Selector
|
| 24 |
### Arguments explained
|
| 25 |
+
The most important one is `content`, it's used to pass the HTML code you want to parse, and it accepts the HTML content as `str` or `bytes`.
|
| 26 |
|
| 27 |
+
Otherwise, you have the arguments `url`, `adaptive`, `storage`, and `storage_args`. All these arguments are settings used with the `adaptive` feature, and they don't make a difference if you are not going to use that feature, so just ignore them for now, and we will explain them in the [adaptive](adaptive.md) feature page.
|
| 28 |
|
| 29 |
+
Then you have the arguments for parsing adjustments or adjusting/manipulating the HTML content while the library is parsing it:
|
| 30 |
|
| 31 |
- **encoding**: This is the encoding that will be used while parsing the HTML. The default is `UTF-8`.
|
| 32 |
+
- **keep_comments**: This tells the library whether to keep HTML comments while parsing the page. It's disabled by default because it can cause issues with your scraping in various ways.
|
| 33 |
+
- **keep_cdata**: Same logic as the HTML comments. [cdata](https://stackoverflow.com/questions/7092236/what-is-cdata-in-html) is removed by default for cleaner HTML.
|
| 34 |
|
| 35 |
I have intended to ignore the arguments `huge_tree` and `root` to avoid making this page more complicated than needed.
|
| 36 |
+
You may notice that I'm doing that a lot because it involves advanced features that you don't need to know to use the library. The development section will cover these missing parts if you are very invested.
|
| 37 |
|
| 38 |
+
After that, for the main page and elements within, most properties are lazily loaded. This means they don't get initialized until you use them like the text content of a page/element, and this is one of the reasons for Scrapling speed :)
|
| 39 |
|
| 40 |
### Properties
|
| 41 |
You have already seen much of this on the [overview](../overview.md) page, but don't worry if you didn't. We will review it more thoroughly using more advanced methods/usages. For clarity, the properties for traversal are separated below in the [traversal](#traversal) section.
|
|
|
|
| 81 |
```
|
| 82 |
Load the page directly as shown before:
|
| 83 |
```python
|
| 84 |
+
from scrapling import Selector
|
| 85 |
+
page = Selector(html_doc)
|
| 86 |
```
|
| 87 |
Get all text content on the page recursively
|
| 88 |
```python
|
| 89 |
>>> page.get_all_text()
|
| 90 |
'Some page\n\n \n\n \nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock'
|
| 91 |
```
|
| 92 |
+
Get the first article, as explained before; we will use it as an example
|
| 93 |
```python
|
| 94 |
article = page.find('article')
|
| 95 |
```
|
|
|
|
| 98 |
>>> article.get_all_text()
|
| 99 |
'Product 1\nThis is product 1\n$10.99\nIn stock: 5'
|
| 100 |
```
|
| 101 |
+
But if you try to get the direct text content, it will be empty because it doesn't have direct text in the HTML code above
|
| 102 |
```python
|
| 103 |
>>> article.text
|
| 104 |
''
|
|
|
|
| 107 |
|
| 108 |
1. **separator**: All strings collected will be concatenated using this separator. The default is '\n'
|
| 109 |
2. **strip**: If enabled, strings will be stripped before concatenation. Disabled by default.
|
| 110 |
+
3. **ignore_tags**: A tuple of all tag names you want to ignore in the final results and ignore any elements nested within them. The default is `('script', 'style',)`.
|
| 111 |
4. **valid_values**: If enabled, the method will only collect elements with real values, so all elements with empty text content or only whitespaces will be ignored. It's enabled by default
|
| 112 |
|
| 113 |
+
By the way, the text returned here is not a standard string but a [TextHandler](#texthandler); we will get to this in detail later, so if the text content can be serialized to JSON, use `.json()` on it
|
| 114 |
```python
|
| 115 |
>>> script = page.find('script')
|
| 116 |
>>> script.json()
|
|
|
|
| 121 |
>>> article.tag
|
| 122 |
'article'
|
| 123 |
```
|
| 124 |
+
If you use it on the page directly, you will find that you are operating on the root `html` element
|
| 125 |
```python
|
| 126 |
>>> page.tag
|
| 127 |
'html'
|
|
|
|
| 133 |
>>> print(article.attrib)
|
| 134 |
{'class': 'product', 'data-id': '1'}
|
| 135 |
```
|
| 136 |
+
Access a specific attribute with any method of the following
|
| 137 |
+
```python
|
| 138 |
+
>>> article.attrib['class']
|
| 139 |
+
>>> article.attrib.get('class')
|
| 140 |
+
>>> article['class'] # new in v0.3
|
| 141 |
+
```
|
| 142 |
+
Check if the attributes contain a specific attribute with any of the methods below
|
| 143 |
+
```python
|
| 144 |
+
>>> 'class' in article.attrib
|
| 145 |
+
>>> 'class' in article # new in v0.3
|
| 146 |
+
```
|
| 147 |
Get the HTML content of the element
|
| 148 |
```python
|
| 149 |
>>> article.html_content
|
|
|
|
| 154 |
>>> article.body
|
| 155 |
'<article class="product" data-id="1"><h3>Product 1</h3>\n <p class="description">This is product 1</p>\n <span class="price">$10.99</span>\n <div class="hidden stock">In stock: 5</div>\n </article>'
|
| 156 |
```
|
| 157 |
+
Get the prettified version of the element's HTML content
|
| 158 |
```python
|
| 159 |
>>> print(article.prettify())
|
| 160 |
<article class="product" data-id="1"><h3>Product 1</h3>
|
|
|
|
| 186 |
```
|
| 187 |
|
| 188 |
### Traversal
|
| 189 |
+
Using the elements we found above, we will go over the properties/methods for moving on the page in detail.
|
| 190 |
|
| 191 |
If you are unfamiliar with the DOM tree or the tree data structure in general, the following traversal part can be confusing. I recommend you look up these concepts online for a better understanding.
|
| 192 |
|
| 193 |
If you are too lazy to search about it, here's a quick explanation to give you a good idea.<br/>
|
| 194 |
+
In simple words, the `html` element is the root of the website's tree, as every page starts with an `html` element.<br/>
|
| 195 |
This element will be directly above elements like `head` and `body`. These are considered "children" of the `html` element, and the `html` element is considered their "parent." The element `body` is a "sibling" of the element `head` and vice versa.
|
| 196 |
|
| 197 |
Accessing the parent of an element
|
|
|
|
| 249 |
```
|
| 250 |
Get the next element of the current element
|
| 251 |
```python
|
| 252 |
+
>>> article.next
|
| 253 |
<data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>
|
| 254 |
```
|
| 255 |
The same logic applies to the `previous` property
|
|
|
|
| 269 |
for ancestor in article.iterancestors():
|
| 270 |
# do something with it...
|
| 271 |
```
|
| 272 |
+
You can search for a specific ancestor of an element that satisfies a search function; all you need to do is to pass a function that takes a [Selector](#selector) object as an argument and return `True` if the condition satisfies or `False` otherwise, like below:
|
| 273 |
```python
|
| 274 |
>>> article.find_ancestor(lambda ancestor: ancestor.has_class('product-list'))
|
| 275 |
<data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
|
|
|
|
| 277 |
>>> article.find_ancestor(lambda ancestor: ancestor.css('.product-list')) # Same result, different approach
|
| 278 |
<data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
|
| 279 |
```
|
| 280 |
+
## Selectors
|
| 281 |
+
The class `Selectors` is the "List" version of the [Selector](#selector) class. It inherits from the Python standard `List` type, so it shares all `List` properties and methods while adding more methods to make the operations you want to execute on the [Selector](#selector) instances within more straightforward.
|
| 282 |
|
| 283 |
+
In the [Selector](#selector) class, all methods/properties that should return a group of elements return them as a [Selectors](#selectors) class instance. The only exceptions are when you use the CSS/XPath methods as follows:
|
| 284 |
|
| 285 |
- If you selected a text node with the selector, then the return type will be [TextHandler](#texthandler)/[TextHandlers](#texthandlers). <br/>Examples:
|
| 286 |
```python
|
|
|
|
| 295 |
```
|
| 296 |
- If you used a combined selector that returns mixed types, the result will be a Python standard `List`. <br/>Examples:
|
| 297 |
```python
|
| 298 |
+
>>> page.css('.price_color') # -> Selectors
|
| 299 |
>>> page.css('.product_pod a::attr(href)') # -> TextHandlers
|
| 300 |
>>> page.css('.price_color, .product_pod a::attr(href)') # -> List
|
| 301 |
```
|
| 302 |
|
| 303 |
+
Let's see what [Selectors](#selectors) class adds to the table with that out of the way.
|
| 304 |
### Properties
|
| 305 |
Apart from the normal operations on Python lists like iteration, slicing, etc...
|
| 306 |
|
| 307 |
You can do the following:
|
| 308 |
|
| 309 |
+
Execute CSS and XPath selectors directly on the [Selector](#selector) instances it has, while the arguments and the return types are the same as [Selector](#selector)'s `css` and `xpath` methods. This, of course, makes chaining methods very straightforward.
|
| 310 |
```python
|
| 311 |
>>> page.css('.product_pod a')
|
| 312 |
[<data='<a href="catalogue/a-light-in-the-attic_...' parent='<div class="image_container"> <a href="c...'>,
|
|
|
|
| 326 |
<data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
|
| 327 |
...]
|
| 328 |
```
|
| 329 |
+
Run the `re` and `re_first` methods directly. They take the same arguments passed to the [Selector](#selector) class. I'm still leaving these methods to be explained in the [TextHandler](#texthandler) section below.
|
| 330 |
|
| 331 |
+
However, in this class, the `re_first` behaves differently as it runs `re` on each [Selector](#selector) within and returns the first one with a result. The `re` method will return a [TextHandlers](#texthandlers) object as normal, that has all the [TextHandler](#texthandler) instances combined in one [TextHandlers](#texthandlers) instance.
|
| 332 |
```python
|
| 333 |
>>> page.css('.price_color').re(r'[\d\.]+')
|
| 334 |
['51.77',
|
|
|
|
| 345 |
'sharp-objects_997',
|
| 346 |
...]
|
| 347 |
```
|
| 348 |
+
With the `search` method, you can search quickly in the available [Selector](#selector) instances. The function you pass must accept a [Selector](#selector) instance as the first argument and return True/False. The method will return the first [Selector](#selector) instance that satisfies the function; otherwise, it will return `None`.
|
| 349 |
```python
|
| 350 |
# Find all the products with price '53.23'
|
| 351 |
>>> search_function = lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) == 54.23
|
| 352 |
>>> page.css('.product_pod').search(search_function)
|
| 353 |
<data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>
|
| 354 |
```
|
| 355 |
+
You can use the `filter` method, too, which takes a function like the `search` method but returns an `Selectors` instance of all the [Selector](#selector) instances that satisfy the function
|
| 356 |
```python
|
| 357 |
# Find all products with prices over $50
|
| 358 |
>>> filtering_function = lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) > 50
|
|
|
|
| 362 |
<data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
|
| 363 |
...]
|
| 364 |
```
|
| 365 |
+
If you are too lazy like me and want to know the number of [Selector](#selector) instances in a [Selectors](#selectors) instance. You can do this:
|
| 366 |
+
```python
|
| 367 |
+
page.css('.product_pod').length
|
| 368 |
+
```
|
| 369 |
+
instead of this
|
| 370 |
+
```python
|
| 371 |
+
len(page.css('.product_pod'))
|
| 372 |
+
```
|
| 373 |
+
Yup, like JavaScript :)
|
| 374 |
|
| 375 |
## TextHandler
|
| 376 |
This class is mandatory to understand, as all methods/properties that should return a string for you will return `TextHandler`, and the ones that should return a list of strings will return [TextHandlers](#texthandlers) instead.
|
| 377 |
|
| 378 |
+
TextHandler is a subclass of the standard Python string, so you can do anything with it that you can do with a Python string. So, what is the difference that requires a different naming?
|
| 379 |
|
| 380 |
+
Of course, TextHandler provides extra methods and properties that standard Python strings can't do. We will review them now, but remember that all methods and properties in all classes that return string(s) return TextHandler, which opens the door for creativity and makes the code shorter and cleaner, as you will see. Also, you can import it directly and use it on any string, which we will explain [later](../development/scrapling_custom_types.md).
|
| 381 |
### Usage
|
| 382 |
+
First, before discussing the added methods, you need to know that all operations on it, like slicing, accessing by index, etc., and methods like `split`, `replace`, `strip`, etc., all return a `TextHandler` again, so you can chain them as you want. If you find a method or property that returns a standard string instead of `TextHandler`, please open an issue, and we will override it as well.
|
| 383 |
|
| 384 |
+
First, we start with the `re` and `re_first` methods. These are the same methods that exist in the rest of the classes ([Selector](#selector), [Selectors](#selectors), and [TextHandlers](#texthandlers)), so they will take the same arguments as well.
|
| 385 |
|
| 386 |
+
- The `re` method takes a string/compiled regex pattern as the first argument. It searches the data for all strings matching the regex and returns them as a [TextHandlers](#texthandlers) instance. The `re_first` method takes the same arguments and behaves similarly, but as you probably figured out from the naming, it returns the first result only as a `TextHandler` instance.
|
| 387 |
|
| 388 |
Also, it takes other helpful arguments, which are:
|
| 389 |
|
| 390 |
- **replace_entities**: This is enabled by default. It replaces character entity references with their corresponding characters.
|
| 391 |
+
- **clean_match**: It's disabled by default. This makes the method ignore all whitespaces and consecutive spaces while matching.
|
| 392 |
+
- **case_sensitive**: It's enabled by default. As the name implies, disabling it will make the regex ignore the case of letters while compiling it.
|
| 393 |
+
|
| 394 |
You have seen these examples before; the return result is [TextHandlers](#texthandlers) because we used the `re` method.
|
| 395 |
```python
|
| 396 |
>>> page.css('.price_color').re(r'[\d\.]+')
|
|
|
|
| 425 |
>>> test_string.re('hi there', clean_match=True, case_sensitive=False)
|
| 426 |
['hi There']
|
| 427 |
```
|
| 428 |
+
Another use of the idea of replacing strings with `TextHandler` everywhere is that a property like `html_content` returns `TextHandler`, so you can do regex on the HTML content if you want:
|
| 429 |
```python
|
| 430 |
>>> page.html_content.re('div class=".*">(.*)</div')
|
| 431 |
['In stock: 5', 'In stock: 3', 'Out of stock']
|
| 432 |
```
|
| 433 |
|
| 434 |
+
- You also have the `.json()` method, which tries to convert the content to a JSON object quickly if possible; otherwise, it throws an error
|
| 435 |
```python
|
| 436 |
>>> page.css_first('#page-data::text')
|
| 437 |
'\n {\n "lastUpdated": "2024-09-22T10:30:00Z",\n "totalProducts": 3\n }\n '
|
| 438 |
>>> page.css_first('#page-data::text').json()
|
| 439 |
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
|
| 440 |
```
|
| 441 |
+
Hence, if you didn't specify a text node while selecting an element (like the text content or an attribute text content), the text content will be selected automatically, like this
|
| 442 |
```python
|
| 443 |
>>> page.css_first('#page-data').json()
|
| 444 |
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
|
| 445 |
```
|
| 446 |
+
The [Selector](#selector) class adds one thing here, too; let's say this is the page we are working with:
|
| 447 |
```html
|
| 448 |
<html>
|
| 449 |
<body>
|
|
|
|
| 458 |
</body>
|
| 459 |
</html>
|
| 460 |
```
|
| 461 |
+
The [Selector](#selector) class has the `get_all_text` method, which you should be aware of by now. This method returns a `TextHandler`, of course.<br/><br/>
|
| 462 |
So, as you know here, if you did something like this
|
| 463 |
```python
|
| 464 |
>>> page.css_first('div::text').json()
|
| 465 |
```
|
| 466 |
+
You will get an error because the `div` tag doesn't have direct text content that can be serialized to JSON; it actually doesn't have direct text content at all.<br/><br/>
|
| 467 |
In this case, the `get_all_text` method comes to the rescue, so you can do something like that
|
| 468 |
```python
|
| 469 |
>>> page.css_first('div').get_all_text(ignore_tags=[]).json()
|
| 470 |
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
|
| 471 |
```
|
| 472 |
I used the `ignore_tags` argument here because the default value of it is `('script', 'style',)`, as you are aware.<br/><br/>
|
| 473 |
+
Another related behavior to be aware of occurs when using any of the fetchers, which we will explain later. If you have a JSON response like this example:
|
| 474 |
```python
|
| 475 |
+
>>> page = Selector("""{"some_key": "some_value"}""")
|
| 476 |
```
|
| 477 |
+
Because the [Selector](#selector) class is optimized to deal with HTML pages, it will deal with it as a broken HTML response and fix it, so if you used the `html_content` property, you get this
|
| 478 |
```python
|
| 479 |
>>> page.html_content
|
| 480 |
'<html><body><p>{"some_key": "some_value"}</p></body></html>'
|
| 481 |
```
|
| 482 |
+
Here, you can use the `json` method directly, and it will work
|
| 483 |
```python
|
| 484 |
>>> page.json()
|
| 485 |
{'some_key': 'some_value'}
|
| 486 |
```
|
| 487 |
+
You might wonder how this happened while the `html` tag doesn't have direct text?<br/>
|
| 488 |
+
Well, for cases like JSON responses, I made the [Selector](#selector) class maintain a raw copy of the content passed to it. This way, when you use the `.json()` method, it checks for that raw copy and then converts it to JSON. If the raw copy is not available like the case with the elements, it checks for the current element text content, or otherwise it used the `get_all_text` method directly.<br/><br/>This might sound hacky a bit but remember, Scrapling is currently optimized to work with HTML pages only so that's the best way till now to handle JSON responses currently without sacrificing speed. This will be changed in the upcoming versions.
|
| 489 |
|
| 490 |
+
- Another handy method is `.clean()`, which will remove all white spaces and consecutive spaces for you and return a new `TextHandler` instance
|
| 491 |
```python
|
| 492 |
>>> TextHandler('\n wonderful idea, \reh?').clean()
|
| 493 |
'wonderful idea, eh?'
|
| 494 |
```
|
| 495 |
|
| 496 |
+
- Another method that might be helpful in some cases is the `.sort()` method to sort the string for you, as you do with lists
|
| 497 |
```python
|
| 498 |
>>> TextHandler('acb').sort()
|
| 499 |
'abc'
|
|
|
|
| 507 |
Other methods and properties will be added over time, but remember that this class is returned in place of strings nearly everywhere in the library.
|
| 508 |
|
| 509 |
## TextHandlers
|
| 510 |
+
You probably guessed it: This class is similar to [Selectors](#selectors) and [Selector](#selector), but here it inherits the same logic and method as standard lists, with only `re` and `re_first` as new methods.
|
| 511 |
|
| 512 |
+
The only difference is that the `re_first` method logic here does `re` on each [TextHandler](#texthandler) within and returns the first result it has or `None`. Nothing is new to explain here, but new methods will be added over time.
|
| 513 |
|
| 514 |
## AttributesHandler
|
| 515 |
+
This is a read-only version of Python's standard dictionary or `dict` that's only used to store the attributes of each element or each [Selector](#selector) instance, in other words.
|
| 516 |
```python
|
| 517 |
>>> print(page.find('script').attrib)
|
| 518 |
{'id': 'page-data', 'type': 'application/json'}
|
| 519 |
>>> type(page.find('script').attrib).__name__
|
| 520 |
'AttributesHandler'
|
| 521 |
```
|
| 522 |
+
Because it's read-only, it will use fewer resources than the standard dictionary. Still, it has the same dictionary method and properties, except those that allow you to modify/override the data.
|
| 523 |
|
| 524 |
It currently adds two extra simple methods:
|
| 525 |
|
|
|
|
| 550 |
|
| 551 |
Hence, I used the `list` function here because `search_values` returns a generator, so it would be `True` for all elements.
|
| 552 |
|
| 553 |
+
- The `json_string` property
|
| 554 |
|
| 555 |
+
This property converts current attributes to a JSON string if the attributes are JSON serializable; otherwise, it throws an error
|
| 556 |
+
```python
|
| 557 |
+
>>>page.find('script').attrib.json_string
|
| 558 |
+
b'{"id":"page-data","type":"application/json"}'
|
| 559 |
+
```
|