vinuyer commited on
Commit
2c6ddd0
·
verified ·
1 Parent(s): 1da566e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +264 -0
README.md ADDED
@@ -0,0 +1,264 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: text-generation
3
+ language:
4
+ - multilingual
5
+ inference: false
6
+ license: cc-by-nc-4.0
7
+ library_name: transformers
8
+ ---
9
+
10
+ <br><br>
11
+
12
+ <p align="center">
13
+ <img src="https://huggingface.co/datasets/jinaai/documentation-images/resolve/main/logo.webp" alt="Jina AI: Your Search Foundation, Supercharged!" width="150px">
14
+ </p>
15
+
16
+ <p align="center">
17
+ <b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
18
+ </p>
19
+
20
+ [Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-html-to-markdown-and-json) | [API](https://jina.ai/reader) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing) | [AWS](https://aws.amazon.com/marketplace/pp/prodview-jwfct4j4rvxk2?sr=0-21&ref_=beagle&applicationId=AWSMPContessa) | [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.reader-lm-v2-vm)| [Arxiv (soon!)]
21
+
22
+ # ReaderLM-v2
23
+
24
+ `ReaderLM-v2` is a 1.5B parameter language model that converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling. Supporting multiple languages (29 in total), `ReaderLM-v2` is specialized for tasks involving HTML parsing, transformation, and text extraction.
25
+
26
+ ## What's New in `ReaderLM-v2`
27
+
28
+ `ReaderLM-v2` represents a significant leap forward from its predecessor, with several key improvements:
29
+
30
+ - **Better Markdown Generation**: Thanks to its new training paradigm and higher-quality training data, the model excels at generating complex elements like code fences, nested lists, tables, and LaTeX equations.
31
+ - **JSON Output**: Introduces direct HTML-to-JSON generation using predefined schemas, eliminating the need for intermediate markdown conversion.
32
+ - **Longer Context Handling**: Handles up to 512K tokens combined input and output length, with improved performance on long-form content.
33
+ - **Multilingual Support**: Comprehensive support across 29 languages for broader applications.
34
+ - **Enhanced Stability**: Greatly alleviates degeneration issues after generating long sequences through contrastive loss during training.
35
+
36
+ ## Model Overview
37
+
38
+ - **Model Type**: Autoregressive, decoder-only transformer
39
+ - **Parameter Count**: 1.54B
40
+ - **Context Window**: Up to 512K tokens (combined input and output)
41
+ - **Hidden Size**: 1536
42
+ - **Number of Layers**: 28
43
+ - **Query Heads**: 12
44
+ - **KV Heads**: 2
45
+ - **Head Size**: 128
46
+ - **Intermediate Size**: 8960
47
+ - **Supported Languages**: English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more (29 total)
48
+
49
+ ---
50
+
51
+ # Usage
52
+
53
+ Below, you will find instructions and examples for using `ReaderLM-v2` locally using the Hugging Face Transformers library.
54
+ For a more hands-on experience in a hosted environment, see the [Google Colab Notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing).
55
+
56
+ ## Via Reader API
57
+
58
+ `ReaderLM-v2` is now fully integrated with [Reader API](https://jina.ai/reader/). To use it, simply specify `x-engine: readerlm-v2` in your request headers and enable response streaming with `-H 'Accept: text/event-stream'`:
59
+
60
+ ```bash
61
+ curl https://r.jina.ai/https://news.ycombinator.com/ -H 'x-engine: readerlm-v2' -H 'Accept: text/event-stream'
62
+ ```
63
+
64
+ You can try it without an API key at a lower rate limit. For higher rate limits, you can purchase an API key. Please note that ReaderLM-v2 requests consume 3x the normal token count from your API key allocation. This is currently an experimental feature, and we're working with the GCP team to improve GPU efficiency.
65
+
66
+ ## On Google Colab
67
+
68
+ You can try `ReaderLM-v2` via our [Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing), which demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example. The notebook is optimized for Colab's free T4 GPU tier and requires `vllm` and `triton` for acceleration and running.
69
+
70
+ Note that the free T4 GPU has limitations—it doesn't support bfloat16 or flash attention 2, leading to higher memory usage and slower processing of longer inputs. Nevertheless, ReaderLM-v2 successfully processes large documents under these constraints, achieving processing speeds of 67 tokens/s input and 36 tokens/s output. For production use, we recommend an RTX 3090/4090 for optimal performance.
71
+
72
+ ## Local Usage
73
+
74
+ To use `ReaderLM-v2` locally:
75
+
76
+ 1. Install the necessary dependencies:
77
+
78
+ ```bash
79
+ pip install transformers
80
+ ```
81
+
82
+ 2. Load and run the model:
83
+
84
+ ```python
85
+ from transformers import AutoModelForCausalLM, AutoTokenizer
86
+
87
+ device = "cuda" # or "cpu"
88
+ tokenizer = AutoTokenizer.from_pretrained("jinaai/ReaderLM-v2")
89
+ model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)
90
+ ```
91
+
92
+ 3. (Optional) Pre-clean your HTML to remove scripts, styles, comments, to reduce the noise and length of the input:
93
+
94
+ ```python
95
+ import re
96
+
97
+ # Patterns
98
+ SCRIPT_PATTERN = r"<[ ]*script.*?\/[ ]*script[ ]*>"
99
+ STYLE_PATTERN = r"<[ ]*style.*?\/[ ]*style[ ]*>"
100
+ META_PATTERN = r"<[ ]*meta.*?>"
101
+ COMMENT_PATTERN = r"<[ ]*!--.*?--[ ]*>"
102
+ LINK_PATTERN = r"<[ ]*link.*?>"
103
+ BASE64_IMG_PATTERN = r'<img[^>]+src="data:image/[^;]+;base64,[^"]+"[^>]*>'
104
+ SVG_PATTERN = r"(<svg[^>]*>)(.*?)(<\/svg>)"
105
+
106
+
107
+ def replace_svg(html: str, new_content: str = "this is a placeholder") -> str:
108
+ return re.sub(
109
+ SVG_PATTERN,
110
+ lambda match: f"{match.group(1)}{new_content}{match.group(3)}",
111
+ html,
112
+ flags=re.DOTALL,
113
+ )
114
+
115
+
116
+ def replace_base64_images(html: str, new_image_src: str = "#") -> str:
117
+ return re.sub(BASE64_IMG_PATTERN, f'<img src="{new_image_src}"/>', html)
118
+
119
+
120
+ def clean_html(html: str, clean_svg: bool = False, clean_base64: bool = False):
121
+ html = re.sub(
122
+ SCRIPT_PATTERN, "", html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL
123
+ )
124
+ html = re.sub(
125
+ STYLE_PATTERN, "", html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL
126
+ )
127
+ html = re.sub(
128
+ META_PATTERN, "", html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL
129
+ )
130
+ html = re.sub(
131
+ COMMENT_PATTERN, "", html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL
132
+ )
133
+ html = re.sub(
134
+ LINK_PATTERN, "", html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL
135
+ )
136
+
137
+ if clean_svg:
138
+ html = replace_svg(html)
139
+ if clean_base64:
140
+ html = replace_base64_images(html)
141
+ return html
142
+ ```
143
+
144
+ 4. Create a prompt for the model:
145
+
146
+ ```python
147
+ def create_prompt(
148
+ text: str, tokenizer=None, instruction: str = None, schema: str = None
149
+ ) -> str:
150
+ """
151
+ Create a prompt for the model with optional instruction and JSON schema.
152
+ """
153
+ if not instruction:
154
+ instruction = "Extract the main content from the given HTML and convert it to Markdown format."
155
+ if schema:
156
+ instruction = "Extract the specified information from a list of news threads and present it in a structured JSON format."
157
+ prompt = f"{instruction}\n```html\n{text}\n```\nThe JSON schema is as follows:```json\n{schema}\n```"
158
+ else:
159
+ prompt = f"{instruction}\n```html\n{text}\n```"
160
+
161
+ messages = [
162
+ {
163
+ "role": "user",
164
+ "content": prompt,
165
+ }
166
+ ]
167
+
168
+ return tokenizer.apply_chat_template(
169
+ messages, tokenize=False, add_generation_prompt=True
170
+ )
171
+ ```
172
+
173
+ ### HTML to Markdown Example
174
+
175
+ ```python
176
+ html = "<html><body><h1>Hello, world!</h1></body></html>"
177
+
178
+ html = clean_html(html)
179
+
180
+ input_prompt = create_prompt(html, tokenizer=tokenizer)
181
+ inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
182
+ outputs = model.generate(
183
+ inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08
184
+ )
185
+
186
+ print(tokenizer.decode(outputs[0]))
187
+ ```
188
+
189
+ ### HTML to JSON Example
190
+
191
+ ```python
192
+ schema = """
193
+ {
194
+ "type": "object",
195
+ "properties": {
196
+ "title": {
197
+ "type": "string"
198
+ },
199
+ "author": {
200
+ "type": "string"
201
+ },
202
+ "date": {
203
+ "type": "string"
204
+ },
205
+ "content": {
206
+ "type": "string"
207
+ }
208
+ },
209
+ "required": ["title", "author", "date", "content"]
210
+ }
211
+ """
212
+
213
+ html = clean_html(html)
214
+ input_prompt = create_prompt(html, tokenizer=tokenizer, schema=schema)
215
+
216
+ inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
217
+ outputs = model.generate(
218
+ inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08
219
+ )
220
+
221
+ print(tokenizer.decode(outputs[0]))
222
+ ```
223
+
224
+ ## Model Performance
225
+
226
+ ReaderLM-v2 has been extensively evaluated on various tasks:
227
+
228
+ ### Quantitative Evaluation
229
+
230
+ For HTML-to-Markdown tasks, the model outperforms much larger models like Qwen2.5-32B-Instruct and Gemini2-flash-expr, achieving:
231
+ - ROUGE-L: 0.84
232
+ - Levenshtein Distance: 0.22
233
+ - Jaro-Winkler Similarity: 0.82
234
+
235
+ For HTML-to-JSON tasks, it shows competitive performance with:
236
+ - F1 Score: 0.81
237
+ - Precision: 0.82
238
+ - Recall: 0.81
239
+ - Pass-Rate: 0.98
240
+
241
+ ### Qualitative Evaluation
242
+
243
+ The model excels in three key dimensions:
244
+ - Content Integrity: 39/50
245
+ - Structural Accuracy: 35/50
246
+ - Format Compliance: 36/50
247
+
248
+ These scores demonstrate strong performance in preserving semantic information, maintaining structural accuracy, and adhering to markdown syntax standards.
249
+
250
+ ## Training Details
251
+
252
+ ReaderLM-v2 is built on Qwen2.5-1.5B-Instruction and trained using a sophisticated pipeline:
253
+
254
+ 1. Data Preparation: Created html-markdown-1m dataset with 1 million HTML documents
255
+ 2. Synthetic Data Generation: Three-step pipeline using Qwen2.5-32B-Instruction
256
+ - Drafting: Initial markdown and JSON generation
257
+ - Refinement: Content cleanup and structure alignment
258
+ - Critique: Quality evaluation and filtering
259
+
260
+ 3. Training Process:
261
+ - Long-context pretraining
262
+ - Supervised fine-tuning
263
+ - Direct preference optimization
264
+ - Self-play reinforcement tuning