Spaces:
Paused
Paused
| # `scrapeURL` | |
| New URL scraper for Firecrawl | |
| ## Signal flow | |
| ```mermaid | |
| flowchart TD; | |
| scrapeURL-.->buildFallbackList; | |
| buildFallbackList-.->scrapeURLWithEngine; | |
| scrapeURLWithEngine-.->parseMarkdown; | |
| parseMarkdown-.->wasScrapeSuccessful{{Was scrape successful?}}; | |
| wasScrapeSuccessful-."No".->areEnginesLeft{{Are there engines left to try?}}; | |
| areEnginesLeft-."Yes, try next engine".->scrapeURLWithEngine; | |
| areEnginesLeft-."No".->NoEnginesLeftError[/NoEnginesLeftError/] | |
| wasScrapeSuccessful-."Yes".->asd; | |
| ``` | |
| ## Differences from `WebScraperDataProvider` | |
| - The job of `WebScraperDataProvider.validateInitialUrl` has been delegated to the zod layer above `scrapeUrl`. | |
| - `WebScraperDataProvider.mode` has no equivalent, only `scrape_url` is supported. | |
| - You may no longer specify multiple URLs. | |
| - Built on `v1` definitons, instead of `v0`. | |
| - PDFs are now converted straight to markdown using LlamaParse, instead of converting to just plaintext. | |
| - DOCXs are now converted straight to HTML (and then later to markdown) using mammoth, instead of converting to just plaintext. | |
| - Using new JSON Schema OpenAI API -- schema fails with LLM Extract will be basically non-existant. | |