Spaces:
Build error
Build error
File size: 9,281 Bytes
21b6938 1a27920 21b6938 42a0156 ea5ca39 1a27920 7ed610d 21b6938 42a0156 21b6938 361f81e b42fe4b d5a3ce1 b42fe4b d5a3ce1 b42fe4b 361f81e ffaf621 361f81e 6f65083 361f81e 3235906 361f81e c2cd71d 89d6d49 c2cd71d 89d6d49 8e241c7 89d6d49 c2cd71d d7fbc41 6ee0f2d 3557cba d7fbc41 3557cba d7fbc41 6ee0f2d af2775d ffaf621 c2cd71d ae788c3 ffaf621 3235906 5f07900 eaaaf77 b6b9d39 eaaaf77 d1d9c1e eaaaf77 d1d9c1e 8378cb0 d7fbc41 eaaaf77 3557cba | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | ---
title: web-reader
emoji: 🐳
colorFrom: blue
colorTo: blue
sdk: docker
app_port: 8080
license: mit
short_description: Generate any application mdx with web reader
---
This Space is Docker-based and will be built using the repository's `Dockerfile`. It uses TypeScript/Node; see `Dockerfile` and `package.json` for build/runtime details.
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
Behind the scenes, Reader searches the web, fetches the top 5 results, visits each URL, and applies `r.jina.ai` to it. This is different from many `web search function-calling` in agent/RAG frameworks, which often return only the title, URL, and description provided by the search engine API. If you want to read one result more deeply, you have to fetch the content yourself from that URL. With Reader, `http://s.jina.ai` automatically fetches the content from the top 5 search result URLs for you (reusing the tech stack behind `http://r.jina.ai`). This means you don't have to handle browser rendering, blocking, or any issues related to JavaScript and CSS yourself.
### Using `s.jina.ai` for in-site search
Simply specify `site` in the query parameters such as:
```bash
curl 'https://s.jina.ai/When%20was%20Jina%20AI%20founded%3F?site=jina.ai&site=github.com'
```
### [Interactive Code Snippet Builder](https://jina.ai/reader#apiform)
We highly recommend using the code builder to explore different parameter combinations of the Reader API.
<a href="https://jina.ai/reader#apiform"><img width="973" alt="image" src="https://github.com/jina-ai/reader/assets/2041322/a490fd3a-1c4c-4a3f-a95a-c481c2a8cc8f"></a>
### Using request headers
As you have already seen above, one can control the behavior of the Reader API using request headers. Here is a complete list of supported headers.
- You can enable the image caption feature via the `x-with-generated-alt: true` header.
- You can ask the Reader API to forward cookies settings via the `x-set-cookie` header.
- Note that requests with cookies will not be cached.
- You can bypass `readability` filtering via the `x-respond-with` header, specifically:
- `x-respond-with: markdown` returns markdown *without* going through `reability`
- `x-respond-with: html` returns `documentElement.outerHTML`
- `x-respond-with: text` returns `document.body.innerText`
- `x-respond-with: screenshot` returns the URL of the webpage's screenshot
- You can specify a proxy server via the `x-proxy-url` header.
- You can customize cache tolerance via the `x-cache-tolerance` header (integer in seconds).
- You can bypass the cached page (lifetime 3600s) via the `x-no-cache: true` header (equivalent of `x-cache-tolerance: 0`).
- If you already know the HTML structure of your target page, you may specify `x-target-selector` or `x-wait-for-selector` to direct the Reader API to focus on a specific part of the page.
- By setting `x-target-selector` header to a CSS selector, the Reader API return the content within the matched element, instead of the full HTML. Setting this header is useful when the automatic content extraction fails to capture the desired content and you can manually select the correct target.
- By setting `x-wait-for-selector` header to a CSS selector, the Reader API will wait until the matched element is rendered before returning the content. If you already specified `x-wait-for-selector`, this header can be omitted if you plan to wait for the same element.
### Using `r.jina.ai` for single page application (SPA) fetching
Many websites nowadays rely on JavaScript frameworks and client-side rendering. Usually known as Single Page Application (SPA). Thanks to [Puppeteer](https://github.com/puppeteer/puppeteer) and headless Chrome browser, Reader natively supports fetching these websites. However, due to specific approach some SPA are developed, there may be some extra precautions to take.
#### SPAs with hash-based routing
By definition of the web standards, content come after `#` in a URL is not sent to the server. To mitigate this issue, use `POST` method with `url` parameter in body.
```bash
curl -X POST 'https://r.jina.ai/' -d 'url=https://example.com/#/route'
```
#### SPAs with preloading contents
Some SPAs, or even some websites that are not strictly SPAs, may show preload contents before later loading the main content dynamically. In this case, Reader may be capturing the preload content instead of the main content. To mitigate this issue, here are some possible solutions:
##### Specifying `x-timeout`
When timeout is explicitly specified, Reader will not attempt to return early and will wait for network idle until the timeout is reached. This is useful when the target website will eventually come to a network idle.
```bash
curl 'https://example.com/' -H 'x-timeout: 30'
```
##### Specifying `x-wait-for-selector`
When wait-for-selector is explicitly specified, Reader will wait for the appearance of the specified CSS selector until timeout is reached. This is useful when you know exactly what element to wait for.
```bash
curl 'https://example.com/' -H 'x-wait-for-selector: #content'
```
### Streaming mode
Streaming mode is useful when you find that the standard mode provides an incomplete result. This is because the Reader will wait a bit longer until the page is *stablely* rendered. Use the accept-header to toggle the streaming mode:
```bash
curl -H "Accept: text/event-stream" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page
```
The data comes in a stream; each subsequent chunk contains more complete information. **The last chunk should provide the most complete and final result.** If you come from LLMs, please note that it is a different behavior than the LLMs' text-generation streaming.
For example, compare these two curl commands below. You can see streaming one gives you complete information at last, whereas standard mode does not. This is because the content loading on this particular site is triggered by some js *after* the page is fully loaded, and standard mode returns the page "too soon".
```bash
curl -H 'x-no-cache: true' https://access.redhat.com/security/cve/CVE-2023-45853
curl -H "Accept: text/event-stream" -H 'x-no-cache: true' https://r.jina.ai/https://access.redhat.com/security/cve/CVE-2023-45853
```
> Note: `-H 'x-no-cache: true'` is used only for demonstration purposes to bypass the cache.
Streaming mode is also useful if your downstream LLM/agent system requires immediate content delivery or needs to process data in chunks to interleave I/O and LLM processing times. This allows for quicker access and more efficient data handling:
```text
Reader API: streamContent1 ----> streamContent2 ----> streamContent3 ---> ...
| | |
v | |
Your LLM: LLM(streamContent1) | |
v |
LLM(streamContent2) |
v
LLM(streamContent3)
```
Note that in terms of completeness: `... > streamContent3 > streamContent2 > streamContent1`, each subsequent chunk contains more complete information.
### JSON mode
This is still very early and the result is not really a "useful" JSON. It contains three fields `url`, `title` and `content` only. Nonetheless, you can use accept-header to control the output format:
```bash
curl -H "Accept: application/json" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page
```
JSON mode is probably more useful in `s.jina.ai` than `r.jina.ai`. For `s.jina.ai` with JSON mode, it returns 5 results in a list, each in the structure of `{'title', 'content', 'url'}`.
### Generated alt
All images in that page that lack `alt` tag can be auto-captioned by a VLM (vision langauge model) and formatted as `!(Image [idx]: [VLM_caption])[img_URL]`. This should give your downstream text-only LLM *just enough* hints to include those images into reasoning, selecting, and summarization. Use the x-with-generated-alt header to toggle the streaming mode:
```bash
curl -H "X-With-Generated-Alt: true" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page
```
## How it works
[](https://deepwiki.com/jina-ai/reader)
## What is `thinapps-shared` submodule?
You might notice a reference to `thinapps-shared` submodule, an internal package we use to share code across our products. While it’s not open-sourced and isn't integral to the Reader's functions, it mainly helps with decorators, logging, secrets management, etc. Feel free to ignore it for now.
That said, this is *the single codebase* behind `https://r.jina.ai`, so everytime we commit here, we will deploy the new version to the `https://r.jina.ai`.
## Having trouble on some websites?
Please raise an issue with the URL you are having trouble with. We will look into it and try to fix it.
## License
Reader is backed by [Jina AI](https://jina.ai) and licensed under [Apache-2.0](./LICENSE).
|