Karim shoair commited on
Commit
72d7db6
·
1 Parent(s): e1908ab

docs: update BS article

Browse files
docs/tutorials/migrating_from_beautifulsoup.md CHANGED
@@ -1,16 +1,16 @@
1
  # Migrating from BeautifulSoup to Scrapling
2
 
3
- If you're already familiar with BeautifulSoup, you're in for a treat. Scrapling is faster, provides similar parsing capabilities, and adds powerful new features for fetching and handling modern web pages. This guide will help you quickly adapt your existing BeautifulSoup code to take advantage of Scrapling's capabilities.
4
 
5
- Below is a table that covers the most common operations you'll perform when scraping web pages. Each row shows how to accomplish a specific task in BeautifulSoup and the corresponding way to do it in Scrapling.
6
 
7
- You will notice some shortcuts in BeautifulSoup are missing in Scrapling, but that's one of the reasons that makes BeautifulSoup slower than Scrapling. The point is: If the same feature can be used in a short oneliner, there is no need to sacrifice performance to make that short line shorter :)
8
 
9
 
10
  | Task | BeautifulSoup Code | Scrapling Code |
11
  |-----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
12
- | Parser import | `from bs4 import BeautifulSoup` | `from scrapling.parser import Adaptor` |
13
- | Parsing HTML from string | `soup = BeautifulSoup(html, 'html.parser')` | `page = Adaptor(html)` |
14
  | Finding a single element | `element = soup.find('div', class_='example')` | `element = page.find('div', class_='example')` |
15
  | Finding multiple elements | `elements = soup.find_all('div', class_='example')` | `elements = page.find_all('div', class_='example')` |
16
  | Finding a single element (Example 2) | `element = soup.find('div', attrs={"class": "example"})` | `element = page.find('div', {"class": "example"})` |
@@ -26,7 +26,7 @@ You will notice some shortcuts in BeautifulSoup are missing in Scrapling, but th
26
  | Extracting text content of an element | `string = element.string` | `string = element.text` |
27
  | Extracting all the text in a document or beneath a tag | `text = soup.get_text(strip=True)` | `text = page.get_all_text(strip=True)` |
28
  | Access the dictionary of attributes | `attrs = element.attrs` | `attrs = element.attrib` |
29
- | Extracting attributes | `attr = element['href']` | `attr = element.attrib['href']` |
30
  | Navigating to parent | `parent = element.parent` | `parent = element.parent` |
31
  | Get all parents of an element | `parents = list(element.parents)` | `parents = list(element.iterancestors())` |
32
  | Searching for an element in the parents of an element | `target_parent = element.find_parent("a")` | `target_parent = element.find_ancestor(lambda p: p.tag == 'a')` |
@@ -44,7 +44,7 @@ You will notice some shortcuts in BeautifulSoup are missing in Scrapling, but th
44
  | Filtering a group of elements that satisfies a condition | `group = soup.find('p', 'story').css.filter('a')` | `group = page.find_all('p', 'story').filter(lambda p: p.tag == 'a')` |
45
 
46
 
47
- One point to remember: BeautifulSoup provides features for modifying and manipulating the page after parsing it. Scrapling focuses more on Scraping the page faster for you, and then you can do what you want with the extracted information. So, two different tools can be used in Web SScraping, but one of them specializes in Web Scraping :)
48
 
49
  ### Putting It All Together
50
 
@@ -71,7 +71,7 @@ for link in links:
71
  from scrapling import Fetcher
72
 
73
  url = 'http://example.com'
74
- page = Fetcher.get(url=url)
75
 
76
  links = page.css('a::attr(href)')
77
  for link in links:
@@ -83,10 +83,10 @@ As you can see, Scrapling simplifies the process by handling the fetching and pa
83
  **Additional Notes:**
84
 
85
  - **Different parsers**: BeautifulSoup allows you to set the parser engine to use, and one of them is `lxml`. Scrapling doesn't do that and uses the `lxml` library by default for performance reasons.
86
- - **Element Types**: In BeautifulSoup, elements are `Tag` objects, while in Scrapling, they are `Adaptor` objects. However, they provide similar methods and properties for navigation and data extraction.
87
- - **Error Handling**: Both libraries return `None` when an element is not found (e.g., `soup.find()` or `page.css_first()`). To avoid errors, Check for `None` before accessing properties.
88
- - **Text Extraction**: Scrapling provides additional methods for handling text through `TextHandler`, such as `clean()`, which can be helpful for removing extra whitespace or unwanted characters. Please check out the documentation for the complete list.
89
 
90
- The documentation provides more details on Scrapling's features and the full list of arguments that can be passed to all methods.
91
 
92
  This guide should make your transition from BeautifulSoup to Scrapling smooth and straightforward. Happy scraping!
 
1
  # Migrating from BeautifulSoup to Scrapling
2
 
3
+ If you're already familiar with BeautifulSoup, you're in for a treat. Scrapling is incredibly faster, provides the same parsing capabilities, adds more parsing capabilities not found in BS, and introduces powerful new features for fetching and handling modern web pages. This guide will help you quickly adapt your existing BeautifulSoup code to leverage Scrapling's capabilities.
4
 
5
+ Below is a table that covers the most common operations you'll perform when scraping web pages. Each row illustrates how to accomplish a specific task using BeautifulSoup and the corresponding method in Scrapling.
6
 
7
+ You will notice that some shortcuts in BeautifulSoup are missing in Scrapling, but that's one of the reasons why BeautifulSoup is slower than Scrapling. The point is: If the same feature can be used in a short oneliner, there is no need to sacrifice performance to shorten that short line :)
8
 
9
 
10
  | Task | BeautifulSoup Code | Scrapling Code |
11
  |-----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
12
+ | Parser import | `from bs4 import BeautifulSoup` | `from scrapling.parser import Selector` |
13
+ | Parsing HTML from string | `soup = BeautifulSoup(html, 'html.parser')` | `page = Selector(html)` |
14
  | Finding a single element | `element = soup.find('div', class_='example')` | `element = page.find('div', class_='example')` |
15
  | Finding multiple elements | `elements = soup.find_all('div', class_='example')` | `elements = page.find_all('div', class_='example')` |
16
  | Finding a single element (Example 2) | `element = soup.find('div', attrs={"class": "example"})` | `element = page.find('div', {"class": "example"})` |
 
26
  | Extracting text content of an element | `string = element.string` | `string = element.text` |
27
  | Extracting all the text in a document or beneath a tag | `text = soup.get_text(strip=True)` | `text = page.get_all_text(strip=True)` |
28
  | Access the dictionary of attributes | `attrs = element.attrs` | `attrs = element.attrib` |
29
+ | Extracting attributes | `attr = element['href']` | `attr = element['href']` |
30
  | Navigating to parent | `parent = element.parent` | `parent = element.parent` |
31
  | Get all parents of an element | `parents = list(element.parents)` | `parents = list(element.iterancestors())` |
32
  | Searching for an element in the parents of an element | `target_parent = element.find_parent("a")` | `target_parent = element.find_ancestor(lambda p: p.tag == 'a')` |
 
44
  | Filtering a group of elements that satisfies a condition | `group = soup.find('p', 'story').css.filter('a')` | `group = page.find_all('p', 'story').filter(lambda p: p.tag == 'a')` |
45
 
46
 
47
+ **One key point to remember**: BeautifulSoup offers features for modifying and manipulating the page after it has been parsed. Scrapling focuses more on scraping the page faster for you, and then you can do what you want with the extracted information. So, two different tools can be used in Web Scraping, but one of them specializes in Web Scraping :)
48
 
49
  ### Putting It All Together
50
 
 
71
  from scrapling import Fetcher
72
 
73
  url = 'http://example.com'
74
+ page = Fetcher.get(url)
75
 
76
  links = page.css('a::attr(href)')
77
  for link in links:
 
83
  **Additional Notes:**
84
 
85
  - **Different parsers**: BeautifulSoup allows you to set the parser engine to use, and one of them is `lxml`. Scrapling doesn't do that and uses the `lxml` library by default for performance reasons.
86
+ - **Element Types**: In BeautifulSoup, elements are `Tag` objects, while in Scrapling, they are `Selector` objects. However, they provide similar methods and properties for navigation and data extraction.
87
+ - **Error Handling**: Both libraries return `None` when an element is not found (e.g., `soup.find()` or `page.css_first()`). To avoid errors, check for `None` before accessing properties.
88
+ - **Text Extraction**: Scrapling provides additional methods for handling text through `TextHandler`, such as `clean()`, which can help remove extra whitespace, consecutive spaces, or unwanted characters. Please check out the documentation for the complete list.
89
 
90
+ The documentation provides more details on Scrapling's features and the complete list of arguments that can be passed to all methods.
91
 
92
  This guide should make your transition from BeautifulSoup to Scrapling smooth and straightforward. Happy scraping!