Karim shoair commited on
Commit ·
960e783
1
Parent(s): e6492de
docs: updating `main classes` page and some corrections
Browse files- docs/parsing/main_classes.md +67 -28
docs/parsing/main_classes.md
CHANGED
|
@@ -165,10 +165,10 @@ print(article.prettify())
|
|
| 165 |
<div class="hidden stock">In stock: 5</div>
|
| 166 |
</article>
|
| 167 |
```
|
| 168 |
-
Use the `.body` property to get the raw content of the page
|
| 169 |
```python
|
| 170 |
>>> page.body
|
| 171 |
-
'<html>\n <head>\n <title>Some page</title>\n </head>\n
|
| 172 |
```
|
| 173 |
To get all the ancestors in the DOM tree of this element
|
| 174 |
```python
|
|
@@ -233,7 +233,7 @@ This element returns the same result as the `children` property because its chil
|
|
| 233 |
|
| 234 |
Another example of using the element with the `product-list` class will clear the difference between the `children` property and the `below_elements` property
|
| 235 |
```python
|
| 236 |
-
>>> products_list = page.
|
| 237 |
>>> products_list.children
|
| 238 |
[<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>,
|
| 239 |
<data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
|
|
@@ -262,7 +262,7 @@ Get the next element of the current element
|
|
| 262 |
The same logic applies to the `previous` property
|
| 263 |
```python
|
| 264 |
>>> article.previous # It's the first child, so it doesn't have a previous element
|
| 265 |
-
>>> second_article = page.
|
| 266 |
>>> second_article.previous
|
| 267 |
<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>
|
| 268 |
```
|
|
@@ -287,27 +287,57 @@ You can search for a specific ancestor of an element that satisfies a search fun
|
|
| 287 |
## Selectors
|
| 288 |
The class `Selectors` is the "List" version of the [Selector](#selector) class. It inherits from the Python standard `List` type, so it shares all `List` properties and methods while adding more methods to make the operations you want to execute on the [Selector](#selector) instances within more straightforward.
|
| 289 |
|
| 290 |
-
In the [Selector](#selector) class, all methods/properties that should return a group of elements return them as a [Selectors](#selectors) class instance.
|
| 291 |
|
| 292 |
-
|
| 293 |
-
|
| 294 |
-
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
|
| 298 |
-
|
| 299 |
-
|
| 300 |
-
|
| 301 |
-
|
| 302 |
-
|
| 303 |
-
- If you used a combined selector that returns mixed types, the result will be a Python standard `List`. <br/>Examples:
|
| 304 |
-
```python
|
| 305 |
-
>>> page.css('.price_color') # -> Selectors
|
| 306 |
-
>>> page.css('.product_pod a::attr(href)') # -> TextHandlers
|
| 307 |
-
>>> page.css('.price_color, .product_pod a::attr(href)') # -> List
|
| 308 |
-
```
|
| 309 |
|
| 310 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 311 |
### Properties
|
| 312 |
Apart from the standard operations on Python lists, such as iteration and slicing.
|
| 313 |
|
|
@@ -369,6 +399,15 @@ You can use the `filter` method, too, which takes a function like the `search` m
|
|
| 369 |
<data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
|
| 370 |
...]
|
| 371 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 372 |
If you are too lazy like me and want to know the number of [Selector](#selector) instances in a [Selectors](#selectors) instance. You can do this:
|
| 373 |
```python
|
| 374 |
page.css('.product_pod').length
|
|
@@ -440,14 +479,14 @@ First, we start with the `re` and `re_first` methods. These are the same methods
|
|
| 440 |
|
| 441 |
- You also have the `.json()` method, which tries to convert the content to a JSON object quickly if possible; otherwise, it throws an error
|
| 442 |
```python
|
| 443 |
-
>>> page.
|
| 444 |
'\n {\n "lastUpdated": "2024-09-22T10:30:00Z",\n "totalProducts": 3\n }\n '
|
| 445 |
-
>>> page.
|
| 446 |
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
|
| 447 |
```
|
| 448 |
Hence, if you didn't specify a text node while selecting an element (like the text content or an attribute text content), the text content will be selected automatically, like this
|
| 449 |
```python
|
| 450 |
-
>>> page.
|
| 451 |
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
|
| 452 |
```
|
| 453 |
The [Selector](#selector) class adds one thing here, too; let's say this is the page we are working with:
|
|
@@ -468,12 +507,12 @@ First, we start with the `re` and `re_first` methods. These are the same methods
|
|
| 468 |
The [Selector](#selector) class has the `get_all_text` method, which you should be aware of by now. This method returns a `TextHandler`, of course.<br/><br/>
|
| 469 |
So, as you know here, if you did something like this
|
| 470 |
```python
|
| 471 |
-
>>> page.
|
| 472 |
```
|
| 473 |
You will get an error because the `div` tag doesn't have any direct text content that can be serialized to JSON; it doesn't have any direct text content at all.<br/><br/>
|
| 474 |
In this case, the `get_all_text` method comes to the rescue, so you can do something like that
|
| 475 |
```python
|
| 476 |
-
>>> page.
|
| 477 |
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
|
| 478 |
```
|
| 479 |
I used the `ignore_tags` argument here because the default value of it is `('script', 'style',)`, as you are aware.<br/><br/>
|
|
|
|
| 165 |
<div class="hidden stock">In stock: 5</div>
|
| 166 |
</article>
|
| 167 |
```
|
| 168 |
+
Use the `.body` property to get the raw content of the page. Starting from v0.4, when used on a `Response` object from fetchers, `.body` always returns `bytes`.
|
| 169 |
```python
|
| 170 |
>>> page.body
|
| 171 |
+
'<html>\n <head>\n <title>Some page</title>\n </head>\n ...'
|
| 172 |
```
|
| 173 |
To get all the ancestors in the DOM tree of this element
|
| 174 |
```python
|
|
|
|
| 233 |
|
| 234 |
Another example of using the element with the `product-list` class will clear the difference between the `children` property and the `below_elements` property
|
| 235 |
```python
|
| 236 |
+
>>> products_list = page.css('.product-list')[0]
|
| 237 |
>>> products_list.children
|
| 238 |
[<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>,
|
| 239 |
<data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
|
|
|
|
| 262 |
The same logic applies to the `previous` property
|
| 263 |
```python
|
| 264 |
>>> article.previous # It's the first child, so it doesn't have a previous element
|
| 265 |
+
>>> second_article = page.css('.product[data-id="2"]')[0]
|
| 266 |
>>> second_article.previous
|
| 267 |
<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>
|
| 268 |
```
|
|
|
|
| 287 |
## Selectors
|
| 288 |
The class `Selectors` is the "List" version of the [Selector](#selector) class. It inherits from the Python standard `List` type, so it shares all `List` properties and methods while adding more methods to make the operations you want to execute on the [Selector](#selector) instances within more straightforward.
|
| 289 |
|
| 290 |
+
In the [Selector](#selector) class, all methods/properties that should return a group of elements return them as a [Selectors](#selectors) class instance.
|
| 291 |
|
| 292 |
+
Starting with v0.4, all selection methods consistently return [Selector](#selector)/[Selectors](#selectors) objects, even for text nodes and attribute values. Text nodes (selected via `::text`, `/text()`, `::attr()`, `/@attr`) are wrapped in [Selector](#selector) objects. These text node selectors have `tag` set to `"#text"`, and their `text` property returns the text value. You can still access the text value directly, and all other properties gracefully return empty/default values.
|
| 293 |
+
|
| 294 |
+
```python
|
| 295 |
+
>>> page.css('a::text') # -> Selectors (of text node Selectors)
|
| 296 |
+
>>> page.xpath('//a/text()') # -> Selectors
|
| 297 |
+
>>> page.css('a::text').get() # -> TextHandler (the first text value)
|
| 298 |
+
>>> page.css('a::text').getall() # -> TextHandlers (all text values)
|
| 299 |
+
>>> page.css('a::attr(href)') # -> Selectors
|
| 300 |
+
>>> page.xpath('//a/@href') # -> Selectors
|
| 301 |
+
>>> page.css('.price_color') # -> Selectors
|
| 302 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 303 |
|
| 304 |
+
### Data extraction methods
|
| 305 |
+
Starting with v0.4, [Selector](#selector) and [Selectors](#selectors) both provide `get()`, `getall()`, and their aliases `extract_first` and `extract` (following Scrapy conventions). The old `get_all()` method has been removed.
|
| 306 |
+
|
| 307 |
+
**On a [Selector](#selector) object:**
|
| 308 |
+
|
| 309 |
+
- `get()` returns a `TextHandler` — for text node selectors, it returns the text value; for HTML element selectors, it returns the serialized outer HTML.
|
| 310 |
+
- `getall()` returns a `TextHandlers` list containing the single serialized string.
|
| 311 |
+
- `extract_first` is an alias for `get()`, and `extract` is an alias for `getall()`.
|
| 312 |
+
|
| 313 |
+
```python
|
| 314 |
+
>>> page.css('h3')[0].get() # Outer HTML of the element
|
| 315 |
+
'<h3>Product 1</h3>'
|
| 316 |
+
|
| 317 |
+
>>> page.css('h3::text')[0].get() # Text value of the text node
|
| 318 |
+
'Product 1'
|
| 319 |
+
```
|
| 320 |
+
|
| 321 |
+
**On a [Selectors](#selectors) object:**
|
| 322 |
+
|
| 323 |
+
- `get(default=None)` returns the serialized string of the **first** element, or `default` if the list is empty.
|
| 324 |
+
- `getall()` serializes **all** elements and returns a `TextHandlers` list.
|
| 325 |
+
- `extract_first` is an alias for `get()`, and `extract` is an alias for `getall()`.
|
| 326 |
+
|
| 327 |
+
```python
|
| 328 |
+
>>> page.css('.price::text').get() # First price text
|
| 329 |
+
'$10.99'
|
| 330 |
+
|
| 331 |
+
>>> page.css('.price::text').getall() # All price texts
|
| 332 |
+
['$10.99', '$20.99', '$15.99']
|
| 333 |
+
|
| 334 |
+
>>> page.css('.price::text').get('') # With default value
|
| 335 |
+
'$10.99'
|
| 336 |
+
```
|
| 337 |
+
|
| 338 |
+
These methods work seamlessly with all selection types (CSS, XPath, `find`, etc.) and are the recommended way to extract text and attribute values in a Scrapy-compatible style.
|
| 339 |
+
|
| 340 |
+
Now, let's see what [Selectors](#selectors) class adds to the table with that out of the way.
|
| 341 |
### Properties
|
| 342 |
Apart from the standard operations on Python lists, such as iteration and slicing.
|
| 343 |
|
|
|
|
| 399 |
<data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
|
| 400 |
...]
|
| 401 |
```
|
| 402 |
+
You can safely access the first or last element without worrying about index errors:
|
| 403 |
+
```python
|
| 404 |
+
>>> page.css('.product').first # First Selector or None
|
| 405 |
+
<data='<article class="product" data-id="1"><h3...'>
|
| 406 |
+
>>> page.css('.product').last # Last Selector or None
|
| 407 |
+
<data='<article class="product" data-id="3"><h3...'>
|
| 408 |
+
>>> page.css('.nonexistent').first # Returns None instead of raising IndexError
|
| 409 |
+
```
|
| 410 |
+
|
| 411 |
If you are too lazy like me and want to know the number of [Selector](#selector) instances in a [Selectors](#selectors) instance. You can do this:
|
| 412 |
```python
|
| 413 |
page.css('.product_pod').length
|
|
|
|
| 479 |
|
| 480 |
- You also have the `.json()` method, which tries to convert the content to a JSON object quickly if possible; otherwise, it throws an error
|
| 481 |
```python
|
| 482 |
+
>>> page.css('#page-data::text').get()
|
| 483 |
'\n {\n "lastUpdated": "2024-09-22T10:30:00Z",\n "totalProducts": 3\n }\n '
|
| 484 |
+
>>> page.css('#page-data::text').get().json()
|
| 485 |
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
|
| 486 |
```
|
| 487 |
Hence, if you didn't specify a text node while selecting an element (like the text content or an attribute text content), the text content will be selected automatically, like this
|
| 488 |
```python
|
| 489 |
+
>>> page.css('#page-data')[0].json()
|
| 490 |
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
|
| 491 |
```
|
| 492 |
The [Selector](#selector) class adds one thing here, too; let's say this is the page we are working with:
|
|
|
|
| 507 |
The [Selector](#selector) class has the `get_all_text` method, which you should be aware of by now. This method returns a `TextHandler`, of course.<br/><br/>
|
| 508 |
So, as you know here, if you did something like this
|
| 509 |
```python
|
| 510 |
+
>>> page.css('div::text').get().json()
|
| 511 |
```
|
| 512 |
You will get an error because the `div` tag doesn't have any direct text content that can be serialized to JSON; it doesn't have any direct text content at all.<br/><br/>
|
| 513 |
In this case, the `get_all_text` method comes to the rescue, so you can do something like that
|
| 514 |
```python
|
| 515 |
+
>>> page.css('div')[0].get_all_text(ignore_tags=[]).json()
|
| 516 |
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
|
| 517 |
```
|
| 518 |
I used the `ignore_tags` argument here because the default value of it is `('script', 'style',)`, as you are aware.<br/><br/>
|