Karim shoair commited on
Commit
c5df5f6
·
1 Parent(s): 72d73e1

docs: removing old documentation

Browse files

Checkout the website at https://scrapling.readthedocs.io

docs/Core/using scrapling custom types.md DELETED
@@ -1,21 +0,0 @@
1
- > You can take advantage from the custom-made types for Scrapling and use it outside the library if you want. It's better than copying their code after all :)
2
-
3
- ### All current types can be imported alone like below
4
- ```python
5
- >>> from scrapling.core.custom_types import TextHandler, AttributesHandler
6
-
7
- >>> somestring = TextHandler('{}')
8
- >>> somestring.json()
9
- '{}'
10
- >>> somedict_1 = AttributesHandler({'a': 1})
11
- >>> somedict_2 = AttributesHandler(a=1)
12
- ```
13
-
14
- Note `TextHandler` is a sub-class of Python's `str` so all normal operations/methods that work with Python strings will work.
15
- If you want to check for the type in your code, it's better to depend on Python built-in function `issubclass`.
16
-
17
- The class `AttributesHandler` is a sub-class of `collections.abc.Mapping` so it's immutable (read-only) and all operations are inherited from it. The data passed can be accessed later though the `._data` method but careful it's of type `types.MappingProxyType` so it's immutable (read-only) as well (faster than `collections.abc.Mapping` by fractions of seconds).
18
-
19
- So basically to make it simple to you if you are new to Python, the same operations and methods from Python standard `dict` type will all work with class `AttributesHandler` except the ones that try to modify the actual data.
20
-
21
- If you want to modify the data inside `AttributesHandler`, you have to convert it to dictionary first like with using the `dict` function and modify it outside.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/Examples/selectorless_stackoverflow.py DELETED
@@ -1,25 +0,0 @@
1
- """
2
- I only made this example to show how Scrapling features can be used to scrape a website without writing any selector
3
- so this script doesn't depend on the website structure.
4
- """
5
-
6
- import requests
7
-
8
- from scrapling import Adaptor
9
-
10
- response = requests.get('https://stackoverflow.com/questions/tagged/web-scraping?sort=MostVotes&filters=NoAcceptedAnswer&edited=true&pagesize=50&page=2')
11
- page = Adaptor(response.text, url=response.url)
12
- # First we will extract the first question title and its author based on the text content
13
- first_question_title = page.find_by_text('Run Selenium Python Script on Remote Server')
14
- first_question_author = page.find_by_text('Ryan')
15
- # because this page changes a lot
16
- if first_question_title and first_question_author:
17
- # If you want you can extract other questions tags like below
18
- first_question = first_question_title.find_ancestor(
19
- lambda ancestor: ancestor.attrib.get('id') and 'question-summary' in ancestor.attrib.get('id')
20
- )
21
- rest_of_questions = first_question.find_similar()
22
- # But since nothing to rely on to extract other titles/authors from these elements without CSS/XPath selectors due to the website nature
23
- # We will get all the rest of the titles/authors in the page depending on the first title and the first author we got above as a starting point
24
- for i, (title, author) in enumerate(zip(first_question_title.find_similar(), first_question_author.find_similar()), start=1):
25
- print(i, title.text, author.text)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/Extending Scrapling/writing storage system.md DELETED
@@ -1,17 +0,0 @@
1
- Scrapling by default is using SQLite but in case you want to write your storage system to store elements properties there for the auto-matching, this tutorial got you covered.
2
-
3
- You might want to use FireBase for example and share the database between multiple spiders on different machines, it's a great idea to use an online database like that because this way the spiders will share with each others.
4
-
5
- So first to make your storage class work, it must do the big 3:
6
- 1. Inherit from the abstract class `scrapling.storage_adaptors.StorageSystemMixin` and accept a string argument which will be the `url` argument to maintain the library logic.
7
- 2. Use the decorator `functools.lru_cache` on top of the class itself to follow the Singleton design pattern as other classes.
8
- 3. Implement methods `save` and `retrieve`, as you see from the type hints:
9
- - The method `save` returns nothing and will get two arguments from the library
10
- * The first one is of type `lxml.html.HtmlElement` which is the element itself, ofc. It must be converted to dictionary using the function `scrapling.utils._StorageTools.element_to_dict` so we keep the same format then saved to your database as you wish.
11
- * The second one is string which is the identifier used for retrieval. The combination of this identifier and the `url` argument from initialization must be unique for each row or the auto-match will be messed up.
12
- - The method `retrieve` takes a string which is the identifier, using it with the `url` passed on initialization the element's dictionary is retrieved from the database and returned if it exist otherwise it returns `None`
13
- > If the instructions weren't clear enough for you, you can check my implementation using SQLite3 in [storage_adaptors](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/storage_adaptors.py) file
14
-
15
- If your class satisfy this, the rest is easy. If you are planning to use the library in a threaded application, make sure that your class supports it. The default used class is thread-safe.
16
-
17
- There are some helper functions added to the abstract class if you want to use it. It's easier to see it for yourself in the [code](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/storage_adaptors.py), it's heavily commented :)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/index.md DELETED
@@ -1,2 +0,0 @@
1
- # This section is still under work but any help is highly appreciated
2
- ## I will try to make full detailed documentation with Sphinx ASAP.