Karim shoair commited on
Commit
960e783
·
1 Parent(s): e6492de

docs: updating `main classes` page and some corrections

Browse files
Files changed (1) hide show
  1. docs/parsing/main_classes.md +67 -28
docs/parsing/main_classes.md CHANGED
@@ -165,10 +165,10 @@ print(article.prettify())
165
  <div class="hidden stock">In stock: 5</div>
166
  </article>
167
  ```
168
- Use the `.body` property to get the raw content of the page
169
  ```python
170
  >>> page.body
171
- '<html>\n <head>\n <title>Some page</title>\n </head>\n <body>\n <div class="product-list">\n <article class="product" data-id="1">\n <h3>Product 1</h3>\n <p class="description">This is product 1</p>\n <span class="price">$10.99</span>\n <div class="hidden stock">In stock: 5</div>\n </article>\n\n <article class="product" data-id="2">\n <h3>Product 2</h3>\n <p class="description">This is product 2</p>\n <span class="price">$20.99</span>\n <div class="hidden stock">In stock: 3</div>\n </article>\n\n <article class="product" data-id="3">\n <h3>Product 3</h3>\n <p class="description">This is product 3</p>\n <span class="price">$15.99</span>\n <div class="hidden stock">Out of stock</div>\n </article>\n </div>\n\n <script id="page-data" type="application/json">\n {\n "lastUpdated": "2024-09-22T10:30:00Z",\n "totalProducts": 3\n }\n </script>\n </body>\n</html>'
172
  ```
173
  To get all the ancestors in the DOM tree of this element
174
  ```python
@@ -233,7 +233,7 @@ This element returns the same result as the `children` property because its chil
233
 
234
  Another example of using the element with the `product-list` class will clear the difference between the `children` property and the `below_elements` property
235
  ```python
236
- >>> products_list = page.css_first('.product-list')
237
  >>> products_list.children
238
  [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>,
239
  <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
@@ -262,7 +262,7 @@ Get the next element of the current element
262
  The same logic applies to the `previous` property
263
  ```python
264
  >>> article.previous # It's the first child, so it doesn't have a previous element
265
- >>> second_article = page.css_first('.product[data-id="2"]')
266
  >>> second_article.previous
267
  <data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>
268
  ```
@@ -287,27 +287,57 @@ You can search for a specific ancestor of an element that satisfies a search fun
287
  ## Selectors
288
  The class `Selectors` is the "List" version of the [Selector](#selector) class. It inherits from the Python standard `List` type, so it shares all `List` properties and methods while adding more methods to make the operations you want to execute on the [Selector](#selector) instances within more straightforward.
289
 
290
- In the [Selector](#selector) class, all methods/properties that should return a group of elements return them as a [Selectors](#selectors) class instance. The only exceptions are when you use the CSS/XPath methods as follows:
291
 
292
- - If you selected a text node with the selector, then the return type will be [TextHandler](#texthandler)/[TextHandlers](#texthandlers). <br/>Examples:
293
- ```python
294
- >>> page.css('a::text') # -> TextHandlers
295
- >>> page.xpath('//a/text()') # -> TextHandlers
296
- >>> page.css_first('a::text') # -> TextHandler
297
- >>> page.xpath_first('//a/text()') # -> TextHandler
298
- >>> page.css('a::attr(href)') # -> TextHandlers
299
- >>> page.xpath('//a/@href') # -> TextHandlers
300
- >>> page.css_first('a::attr(href)') # -> TextHandler
301
- >>> page.xpath_first('//a/@href') # -> TextHandler
302
- ```
303
- - If you used a combined selector that returns mixed types, the result will be a Python standard `List`. <br/>Examples:
304
- ```python
305
- >>> page.css('.price_color') # -> Selectors
306
- >>> page.css('.product_pod a::attr(href)') # -> TextHandlers
307
- >>> page.css('.price_color, .product_pod a::attr(href)') # -> List
308
- ```
309
 
310
- Let's see what [Selectors](#selectors) class adds to the table with that out of the way.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
311
  ### Properties
312
  Apart from the standard operations on Python lists, such as iteration and slicing.
313
 
@@ -369,6 +399,15 @@ You can use the `filter` method, too, which takes a function like the `search` m
369
  <data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
370
  ...]
371
  ```
 
 
 
 
 
 
 
 
 
372
  If you are too lazy like me and want to know the number of [Selector](#selector) instances in a [Selectors](#selectors) instance. You can do this:
373
  ```python
374
  page.css('.product_pod').length
@@ -440,14 +479,14 @@ First, we start with the `re` and `re_first` methods. These are the same methods
440
 
441
  - You also have the `.json()` method, which tries to convert the content to a JSON object quickly if possible; otherwise, it throws an error
442
  ```python
443
- >>> page.css_first('#page-data::text')
444
  '\n {\n "lastUpdated": "2024-09-22T10:30:00Z",\n "totalProducts": 3\n }\n '
445
- >>> page.css_first('#page-data::text').json()
446
  {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
447
  ```
448
  Hence, if you didn't specify a text node while selecting an element (like the text content or an attribute text content), the text content will be selected automatically, like this
449
  ```python
450
- >>> page.css_first('#page-data').json()
451
  {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
452
  ```
453
  The [Selector](#selector) class adds one thing here, too; let's say this is the page we are working with:
@@ -468,12 +507,12 @@ First, we start with the `re` and `re_first` methods. These are the same methods
468
  The [Selector](#selector) class has the `get_all_text` method, which you should be aware of by now. This method returns a `TextHandler`, of course.<br/><br/>
469
  So, as you know here, if you did something like this
470
  ```python
471
- >>> page.css_first('div::text').json()
472
  ```
473
  You will get an error because the `div` tag doesn't have any direct text content that can be serialized to JSON; it doesn't have any direct text content at all.<br/><br/>
474
  In this case, the `get_all_text` method comes to the rescue, so you can do something like that
475
  ```python
476
- >>> page.css_first('div').get_all_text(ignore_tags=[]).json()
477
  {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
478
  ```
479
  I used the `ignore_tags` argument here because the default value of it is `('script', 'style',)`, as you are aware.<br/><br/>
 
165
  <div class="hidden stock">In stock: 5</div>
166
  </article>
167
  ```
168
+ Use the `.body` property to get the raw content of the page. Starting from v0.4, when used on a `Response` object from fetchers, `.body` always returns `bytes`.
169
  ```python
170
  >>> page.body
171
+ '<html>\n <head>\n <title>Some page</title>\n </head>\n ...'
172
  ```
173
  To get all the ancestors in the DOM tree of this element
174
  ```python
 
233
 
234
  Another example of using the element with the `product-list` class will clear the difference between the `children` property and the `below_elements` property
235
  ```python
236
+ >>> products_list = page.css('.product-list')[0]
237
  >>> products_list.children
238
  [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>,
239
  <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
 
262
  The same logic applies to the `previous` property
263
  ```python
264
  >>> article.previous # It's the first child, so it doesn't have a previous element
265
+ >>> second_article = page.css('.product[data-id="2"]')[0]
266
  >>> second_article.previous
267
  <data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>
268
  ```
 
287
  ## Selectors
288
  The class `Selectors` is the "List" version of the [Selector](#selector) class. It inherits from the Python standard `List` type, so it shares all `List` properties and methods while adding more methods to make the operations you want to execute on the [Selector](#selector) instances within more straightforward.
289
 
290
+ In the [Selector](#selector) class, all methods/properties that should return a group of elements return them as a [Selectors](#selectors) class instance.
291
 
292
+ Starting with v0.4, all selection methods consistently return [Selector](#selector)/[Selectors](#selectors) objects, even for text nodes and attribute values. Text nodes (selected via `::text`, `/text()`, `::attr()`, `/@attr`) are wrapped in [Selector](#selector) objects. These text node selectors have `tag` set to `"#text"`, and their `text` property returns the text value. You can still access the text value directly, and all other properties gracefully return empty/default values.
293
+
294
+ ```python
295
+ >>> page.css('a::text') # -> Selectors (of text node Selectors)
296
+ >>> page.xpath('//a/text()') # -> Selectors
297
+ >>> page.css('a::text').get() # -> TextHandler (the first text value)
298
+ >>> page.css('a::text').getall() # -> TextHandlers (all text values)
299
+ >>> page.css('a::attr(href)') # -> Selectors
300
+ >>> page.xpath('//a/@href') # -> Selectors
301
+ >>> page.css('.price_color') # -> Selectors
302
+ ```
 
 
 
 
 
 
303
 
304
+ ### Data extraction methods
305
+ Starting with v0.4, [Selector](#selector) and [Selectors](#selectors) both provide `get()`, `getall()`, and their aliases `extract_first` and `extract` (following Scrapy conventions). The old `get_all()` method has been removed.
306
+
307
+ **On a [Selector](#selector) object:**
308
+
309
+ - `get()` returns a `TextHandler` — for text node selectors, it returns the text value; for HTML element selectors, it returns the serialized outer HTML.
310
+ - `getall()` returns a `TextHandlers` list containing the single serialized string.
311
+ - `extract_first` is an alias for `get()`, and `extract` is an alias for `getall()`.
312
+
313
+ ```python
314
+ >>> page.css('h3')[0].get() # Outer HTML of the element
315
+ '<h3>Product 1</h3>'
316
+
317
+ >>> page.css('h3::text')[0].get() # Text value of the text node
318
+ 'Product 1'
319
+ ```
320
+
321
+ **On a [Selectors](#selectors) object:**
322
+
323
+ - `get(default=None)` returns the serialized string of the **first** element, or `default` if the list is empty.
324
+ - `getall()` serializes **all** elements and returns a `TextHandlers` list.
325
+ - `extract_first` is an alias for `get()`, and `extract` is an alias for `getall()`.
326
+
327
+ ```python
328
+ >>> page.css('.price::text').get() # First price text
329
+ '$10.99'
330
+
331
+ >>> page.css('.price::text').getall() # All price texts
332
+ ['$10.99', '$20.99', '$15.99']
333
+
334
+ >>> page.css('.price::text').get('') # With default value
335
+ '$10.99'
336
+ ```
337
+
338
+ These methods work seamlessly with all selection types (CSS, XPath, `find`, etc.) and are the recommended way to extract text and attribute values in a Scrapy-compatible style.
339
+
340
+ Now, let's see what [Selectors](#selectors) class adds to the table with that out of the way.
341
  ### Properties
342
  Apart from the standard operations on Python lists, such as iteration and slicing.
343
 
 
399
  <data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
400
  ...]
401
  ```
402
+ You can safely access the first or last element without worrying about index errors:
403
+ ```python
404
+ >>> page.css('.product').first # First Selector or None
405
+ <data='<article class="product" data-id="1"><h3...'>
406
+ >>> page.css('.product').last # Last Selector or None
407
+ <data='<article class="product" data-id="3"><h3...'>
408
+ >>> page.css('.nonexistent').first # Returns None instead of raising IndexError
409
+ ```
410
+
411
  If you are too lazy like me and want to know the number of [Selector](#selector) instances in a [Selectors](#selectors) instance. You can do this:
412
  ```python
413
  page.css('.product_pod').length
 
479
 
480
  - You also have the `.json()` method, which tries to convert the content to a JSON object quickly if possible; otherwise, it throws an error
481
  ```python
482
+ >>> page.css('#page-data::text').get()
483
  '\n {\n "lastUpdated": "2024-09-22T10:30:00Z",\n "totalProducts": 3\n }\n '
484
+ >>> page.css('#page-data::text').get().json()
485
  {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
486
  ```
487
  Hence, if you didn't specify a text node while selecting an element (like the text content or an attribute text content), the text content will be selected automatically, like this
488
  ```python
489
+ >>> page.css('#page-data')[0].json()
490
  {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
491
  ```
492
  The [Selector](#selector) class adds one thing here, too; let's say this is the page we are working with:
 
507
  The [Selector](#selector) class has the `get_all_text` method, which you should be aware of by now. This method returns a `TextHandler`, of course.<br/><br/>
508
  So, as you know here, if you did something like this
509
  ```python
510
+ >>> page.css('div::text').get().json()
511
  ```
512
  You will get an error because the `div` tag doesn't have any direct text content that can be serialized to JSON; it doesn't have any direct text content at all.<br/><br/>
513
  In this case, the `get_all_text` method comes to the rescue, so you can do something like that
514
  ```python
515
+ >>> page.css('div')[0].get_all_text(ignore_tags=[]).json()
516
  {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
517
  ```
518
  I used the `ignore_tags` argument here because the default value of it is `('script', 'style',)`, as you are aware.<br/><br/>