Karim shoair commited on
Commit ·
c5df5f6
1
Parent(s): 72d73e1
docs: removing old documentation
Browse filesCheckout the website at https://scrapling.readthedocs.io
docs/Core/using scrapling custom types.md
DELETED
|
@@ -1,21 +0,0 @@
|
|
| 1 |
-
> You can take advantage from the custom-made types for Scrapling and use it outside the library if you want. It's better than copying their code after all :)
|
| 2 |
-
|
| 3 |
-
### All current types can be imported alone like below
|
| 4 |
-
```python
|
| 5 |
-
>>> from scrapling.core.custom_types import TextHandler, AttributesHandler
|
| 6 |
-
|
| 7 |
-
>>> somestring = TextHandler('{}')
|
| 8 |
-
>>> somestring.json()
|
| 9 |
-
'{}'
|
| 10 |
-
>>> somedict_1 = AttributesHandler({'a': 1})
|
| 11 |
-
>>> somedict_2 = AttributesHandler(a=1)
|
| 12 |
-
```
|
| 13 |
-
|
| 14 |
-
Note `TextHandler` is a sub-class of Python's `str` so all normal operations/methods that work with Python strings will work.
|
| 15 |
-
If you want to check for the type in your code, it's better to depend on Python built-in function `issubclass`.
|
| 16 |
-
|
| 17 |
-
The class `AttributesHandler` is a sub-class of `collections.abc.Mapping` so it's immutable (read-only) and all operations are inherited from it. The data passed can be accessed later though the `._data` method but careful it's of type `types.MappingProxyType` so it's immutable (read-only) as well (faster than `collections.abc.Mapping` by fractions of seconds).
|
| 18 |
-
|
| 19 |
-
So basically to make it simple to you if you are new to Python, the same operations and methods from Python standard `dict` type will all work with class `AttributesHandler` except the ones that try to modify the actual data.
|
| 20 |
-
|
| 21 |
-
If you want to modify the data inside `AttributesHandler`, you have to convert it to dictionary first like with using the `dict` function and modify it outside.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/Examples/selectorless_stackoverflow.py
DELETED
|
@@ -1,25 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
I only made this example to show how Scrapling features can be used to scrape a website without writing any selector
|
| 3 |
-
so this script doesn't depend on the website structure.
|
| 4 |
-
"""
|
| 5 |
-
|
| 6 |
-
import requests
|
| 7 |
-
|
| 8 |
-
from scrapling import Adaptor
|
| 9 |
-
|
| 10 |
-
response = requests.get('https://stackoverflow.com/questions/tagged/web-scraping?sort=MostVotes&filters=NoAcceptedAnswer&edited=true&pagesize=50&page=2')
|
| 11 |
-
page = Adaptor(response.text, url=response.url)
|
| 12 |
-
# First we will extract the first question title and its author based on the text content
|
| 13 |
-
first_question_title = page.find_by_text('Run Selenium Python Script on Remote Server')
|
| 14 |
-
first_question_author = page.find_by_text('Ryan')
|
| 15 |
-
# because this page changes a lot
|
| 16 |
-
if first_question_title and first_question_author:
|
| 17 |
-
# If you want you can extract other questions tags like below
|
| 18 |
-
first_question = first_question_title.find_ancestor(
|
| 19 |
-
lambda ancestor: ancestor.attrib.get('id') and 'question-summary' in ancestor.attrib.get('id')
|
| 20 |
-
)
|
| 21 |
-
rest_of_questions = first_question.find_similar()
|
| 22 |
-
# But since nothing to rely on to extract other titles/authors from these elements without CSS/XPath selectors due to the website nature
|
| 23 |
-
# We will get all the rest of the titles/authors in the page depending on the first title and the first author we got above as a starting point
|
| 24 |
-
for i, (title, author) in enumerate(zip(first_question_title.find_similar(), first_question_author.find_similar()), start=1):
|
| 25 |
-
print(i, title.text, author.text)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/Extending Scrapling/writing storage system.md
DELETED
|
@@ -1,17 +0,0 @@
|
|
| 1 |
-
Scrapling by default is using SQLite but in case you want to write your storage system to store elements properties there for the auto-matching, this tutorial got you covered.
|
| 2 |
-
|
| 3 |
-
You might want to use FireBase for example and share the database between multiple spiders on different machines, it's a great idea to use an online database like that because this way the spiders will share with each others.
|
| 4 |
-
|
| 5 |
-
So first to make your storage class work, it must do the big 3:
|
| 6 |
-
1. Inherit from the abstract class `scrapling.storage_adaptors.StorageSystemMixin` and accept a string argument which will be the `url` argument to maintain the library logic.
|
| 7 |
-
2. Use the decorator `functools.lru_cache` on top of the class itself to follow the Singleton design pattern as other classes.
|
| 8 |
-
3. Implement methods `save` and `retrieve`, as you see from the type hints:
|
| 9 |
-
- The method `save` returns nothing and will get two arguments from the library
|
| 10 |
-
* The first one is of type `lxml.html.HtmlElement` which is the element itself, ofc. It must be converted to dictionary using the function `scrapling.utils._StorageTools.element_to_dict` so we keep the same format then saved to your database as you wish.
|
| 11 |
-
* The second one is string which is the identifier used for retrieval. The combination of this identifier and the `url` argument from initialization must be unique for each row or the auto-match will be messed up.
|
| 12 |
-
- The method `retrieve` takes a string which is the identifier, using it with the `url` passed on initialization the element's dictionary is retrieved from the database and returned if it exist otherwise it returns `None`
|
| 13 |
-
> If the instructions weren't clear enough for you, you can check my implementation using SQLite3 in [storage_adaptors](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/storage_adaptors.py) file
|
| 14 |
-
|
| 15 |
-
If your class satisfy this, the rest is easy. If you are planning to use the library in a threaded application, make sure that your class supports it. The default used class is thread-safe.
|
| 16 |
-
|
| 17 |
-
There are some helper functions added to the abstract class if you want to use it. It's easier to see it for yourself in the [code](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/storage_adaptors.py), it's heavily commented :)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/index.md
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
# This section is still under work but any help is highly appreciated
|
| 2 |
-
## I will try to make full detailed documentation with Sphinx ASAP.
|
|
|
|
|
|
|
|
|