Spaces:

lenson78
/

Scrapling

Paused

App Files Files Community

Karim shoair commited on Aug 29, 2025

Commit

4d1a1ad

1 Parent(s): ccf41bc

docs: Add a page about the interactive shell

Browse files

Files changed (1) hide show

docs/cli/interactive-shell.md +235 -0

docs/cli/interactive-shell.md ADDED Viewed

	@@ -0,0 +1,235 @@

+# Scrapling Interactive Shell Guide
+<script src="https://asciinema.org/a/736339.js" id="asciicast-736339" async data-autoplay="1" data-loop="1" data-cols="225" data-rows="40" data-start-at="00:06" data-speed="1.5"></script>
+**Powerful Web Scraping REPL for Developers and Data Scientists**
+The Scrapling Interactive Shell is an enhanced IPython-based environment designed specifically for Web Scraping tasks. It provides instant access to all Scrapling features, clever shortcuts, automatic page management, and advanced tools like curl command conversion.
+## Why use the Interactive Shell?
+The interactive shell transforms web scraping from a slow script-and-run cycle into a fast, exploratory experience. It's perfect for:
+- **Rapid prototyping**: Test scraping strategies instantly
+- **Data exploration**: Interactively navigate and extract from websites
+- **Learning Scrapling**: Experiment with features in real-time
+- **Debugging scrapers**: Step through requests and inspect results
+- **Converting workflows**: Transform curl commands from browser DevTools to a Fetcher request in a one-liner
+## Getting Started
+### Launch the Shell
+```bash
+# Start the interactive shell
+scrapling shell
+# Execute code and exit (useful for scripting)
+scrapling shell -c "get('https://quotes.toscrape.com'); print(len(page.css('.quote')))"
+# Set logging level
+scrapling shell --loglevel info
+```
+Once launched, you'll see the Scrapling banner and can immediately start scraping as the video above shows:
+```python
+# No imports needed - everything is ready!
+>>> get('https://news.ycombinator.com')
+>>> # Explore the page structure
+>>> page.css('a')[:5]  # Look at first 5 links
+>>> # Refine your selectors
+>>> stories = page.css('.titleline>a')
+>>> len(stories)
+30
+>>> # Extract specific data
+>>> for story in stories[:3]:
+...     title = story.text
+...     url = story['href']
+...     print(f"{title}: {url}")
+>>> # Try different approaches
+>>> titles = page.css('.titleline>a::text')  # Direct text extraction
+>>> urls = page.css('.titleline>a::attr(href)')  # Direct attribute extraction
+```
+## Built-in Shortcuts
+The shell provides convenient shortcuts that eliminate boilerplate code:
+- **`get(url, **kwargs)`** - HTTP GET request (instead of `Fetcher.get`)
+- **`post(url, **kwargs)`** - HTTP POST request (instead of `Fetcher.post`)
+- **`put(url, **kwargs)`** - HTTP PUT request (instead of `Fetcher.put`)
+- **`delete(url, **kwargs)`** - HTTP DELETE request (instead of `Fetcher.delete`)
+- **`fetch(url, **kwargs)`** - Browser-based fetch (instead of `DynamicFetcher.fetch`)
+- **`stealthy_fetch(url, **kwargs)`** - Stealthy browser fetch (instead of `StealthyFetcher.fetch`)
+The most commonly used classes are automatically available without any import, including `Fetcher`, `AsyncFetcher`, `DynamicFetcher`, `StealthyFetcher`, and `Selector`.
+### Smart Page Management
+The shell automatically tracks your requests and pages:
+- **Current Page Access**
+    The `page` and `response` commands are automatically updated with the last fetched page:
+    ```python
+    >>> get('https://quotes.toscrape.com')
+    >>> # 'page' and 'response' both refer to the last fetched page
+    >>> page.url
+    'https://quotes.toscrape.com'
+    >>> response.status  # Same as page.status
+    200
+    ```
+- **Page History**
+    The `pages` command keeps track of the last five pages (it's a `Selectors` object):
+    ```python
+    >>> get('https://site1.com')
+    >>> get('https://site2.com')
+    >>> get('https://site3.com')
+    >>> # Access last 5 pages
+    >>> len(pages)  # `Selectors` object with `page` history
+    3
+    >>> pages[0].url  # First page in history
+    'https://site1.com'
+    >>> pages[-1].url  # Most recent page
+    'https://site3.com'
+    >>> # Work with historical pages
+    >>> for i, old_page in enumerate(pages):
+    ...     print(f"Page {i}: {old_page.url} - {old_page.status}")
+    ```
+## Additional helpful commands
+### Page Visualization
+View scraped pages in your browser:
+```python
+>>> get('https://quotes.toscrape.com')
+>>> view(page)  # Opens the page HTML in your default browser
+```
+### Curl Command Integration
+The shell provides a few functions to help you convert curl commands from the browser DevTools to `Fetcher` requests, which are `uncurl` and `curl2fetcher`. First, you need to copy a request as a curl command like the following:
+<img src="../../assets/scrapling_shell_curl.png" title="Copying a request as a curl command from Chrome" alt="Copying a request as a curl command from Chrome" style="width: 70%;"/>
+- **Convert Curl command to Request Object**
+    ```python
+    >>> curl_cmd = '''curl 'https://httpbin.org/post' \
+    ...   -X POST \
+    ...   -H 'Content-Type: application/json' \
+    ...   -d '{"name": "test", "value": 123}' '''
+    >>> request = uncurl(curl_cmd)
+    >>> request.method
+    'post'
+    >>> request.url
+    'https://httpbin.org/post'
+    >>> request.headers
+    {'Content-Type': 'application/json'}
+    ```
+- **Execute Curl Command Directly**
+    ```python
+    >>> # Convert and execute in one step
+    >>> curl2fetcher(curl_cmd)
+    >>> page.status
+    200
+    >>> page.json()['json']
+    {'name': 'test', 'value': 123}
+    ```
+### IPython Features
+The shell inherits all IPython capabilities:
+```python
+>>> # Magic commands
+>>> %time page = get('https://example.com')  # Time execution
+>>> %history  # Show command history
+>>> %save filename.py 1-10  # Save commands 1-10 to file
+>>> # Tab completion works everywhere
+>>> page.c<TAB>  # Shows: css, css_first, cookies, etc.
+>>> Fetcher.<TAB>  # Shows all Fetcher methods
+>>> # Object inspection
+>>> get? # Show get documentation
+```
+## Examples
+Here are a few examples generated via AI:
+#### E-commerce Data Collection
+```python
+>>> # Start with product listing page
+>>> catalog = get('https://shop.example.com/products')
+>>> # Find product links
+>>> product_links = catalog.css('.product-link::attr(href)')
+>>> print(f"Found {len(product_links)} products")
+>>> # Sample a few products first
+>>> for link in product_links[:3]:
+...     product = get(f"https://shop.example.com{link}")
+...     name = product.css('.product-name::text').get('')
+...     price = product.css('.price::text').get('')
+...     print(f"{name}: {price}")
+>>> # Scale up with sessions for efficiency
+>>> from scrapling.fetchers import FetcherSession
+>>> with FetcherSession() as session:
+...     products = []
+...     for link in product_links:
+...         product = session.get(f"https://shop.example.com{link}")
+...         products.append({
+...             'name': product.css('.product-name::text').get(''),
+...             'price': product.css('.price::text').get(''),
+...             'url': link
+...         })
+```
+#### API Integration and Testing
+```python
+>>> # Test API endpoints interactively
+>>> response = get('https://jsonplaceholder.typicode.com/posts/1')
+>>> response.json()
+{'userId': 1, 'id': 1, 'title': 'sunt aut...', 'body': 'quia et...'}
+>>> # Test POST requests
+>>> new_post = post('https://jsonplaceholder.typicode.com/posts',
+...                 json={'title': 'Test Post', 'body': 'Test content', 'userId': 1})
+>>> new_post.json()['id']
+101
+>>> # Test with different data
+>>> updated = put(f'https://jsonplaceholder.typicode.com/posts/{new_post.json()["id"]}',
+...               json={'title': 'Updated Title'})
+```
+## Getting Help
+If you need help other than what is available in-terminal, you can:
+- [Scrapling Documentation](https://scrapling.readthedocs.io/)
+- [Discord Community](https://discord.gg/EMgGbDceNQ)
+- [GitHub Issues](https://github.com/D4Vinci/Scrapling/issues)
+And that's it! Happy scraping! The shell makes web scraping as easy as a conversation.