OmAlve commited on
Commit
dd73e30
·
verified ·
1 Parent(s): 4cf1cb1

Training in progress, step 1500

Browse files
README.md CHANGED
@@ -1,209 +1,58 @@
1
  ---
2
- language:
3
- - en
4
- license: apache-2.0
5
- license_link: https://huggingface.co/Qwen/Qwen3-0.6B/blob/main/LICENSE
6
- library_name: transformers
7
  base_model: Qwen/Qwen3-0.6B
 
 
8
  tags:
 
9
  - trl
10
  - sft
11
- - qwen3
12
- - web-extraction
13
- - indexlm
14
- - reading-steiner
15
- pipeline_tag: text-generation
16
- ---
17
-
18
- # Reading Steiner
19
-
20
- **Reading Steiner** is a **8192-token (8k) context** supervised fine-tuned (SFT) model for **index-based web content extraction**, in the spirit of [IndexLM](https://arxiv.org/abs/2512.06641). It reads a page as **numbered blocks** `[i] <tag>…</tag>` and predicts **inclusive index intervals** for either a **user query** (query-relevant) or **main body** text (main-content), as plain text like `[[2,4],[7,7]]` or `NA`.
21
-
22
- - **Context length:** **8192** tokens (`max_length` in training)
23
- - **Base model:** [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) (~596M parameters)
24
- - **Training data:** [OmAlve/reading-steiner-data](https://huggingface.co/datasets/OmAlve/reading-steiner-data) (`messages` SFT)
25
- - **Paper:** [An Index-based Approach for Efficient and Effective Web Content Extraction](https://arxiv.org/abs/2512.06641)
26
-
27
- ## Intended use
28
-
29
- 1. **Query-relevant (QE)** — blocks that support answering a question.
30
- 2. **Main-content (ME)** — blocks that are the article body vs nav/ads/sidebars.
31
-
32
- You supply **blocks**; the model does not fetch URLs or parse raw HTML trees.
33
-
34
- ---
35
-
36
- ## System prompts (training)
37
-
38
- ### Query-relevant (QE)
39
-
40
- ```
41
- You are Reading Steiner, a web content extraction model. Given a webpage split into indexed blocks and a user query, identify which blocks contain content relevant to the query.
42
-
43
- Each block is formatted as: [i] <tag>content</tag>
44
- Output the indices of relevant blocks as a Python list of [start, end] intervals (inclusive).
45
- If no relevant content exists, output 'NA'.
46
-
47
- Example output: [[2,4],[7,7],[10,12]]
48
- ```
49
-
50
- ### Main-content (ME)
51
-
52
- ```
53
- You are Reading Steiner, a web content extraction model. Given a webpage split into indexed blocks, identify which blocks contain the main content of the page (filtering out navigation, advertisements, sidebars, and other non-content elements).
54
-
55
- Each block is formatted as: [i] <tag>content</tag>
56
- Output the indices of main content blocks as a Python list of [start, end] intervals (inclusive).
57
- If no main content exists, output 'NA'.
58
-
59
- Example output: [[1,3],[5,8],[11,15]]
60
- ```
61
-
62
- ---
63
-
64
- ## User message format
65
-
66
- ### QE
67
-
68
- ```text
69
- URL: <string>
70
- Query: <question>
71
-
72
- Blocks:
73
- <one block per line>
74
-
75
- Output the index intervals of blocks relevant to the query.
76
- ```
77
-
78
- ### ME
79
-
80
- ```text
81
- URL: <string>
82
- Title: <page title>
83
-
84
- Blocks:
85
- <one block per line>
86
-
87
- Output the index intervals of main content blocks.
88
- ```
89
-
90
  ---
91
 
92
- ## Minimal examples (full `messages`)
93
-
94
- ### Example A — QE
95
 
96
- | Role | Content |
97
- |------|---------|
98
- | system | *(QE system prompt above)* |
99
- | user | `URL: https://example.com/article\nQuery: What substance does the article say was detected?\n\nBlocks:\n[1] <nav>Home \| Science</nav>\n[2] <h1>Water on Mars</h1>\n[3] <p>Researchers reported trace amounts of perchlorate in regolith samples.</p>\n[4] <div class="ad">Subscribe for more space news</div>\n\nOutput the index intervals of blocks relevant to the query.` |
100
- | assistant | `[[3,3]]` |
101
 
102
- ### Example B — ME
103
-
104
- | Role | Content |
105
- |------|---------|
106
- | system | *(ME system prompt above)* |
107
- | user | `URL: https://example.com/news\nTitle: Local river cleanup\n\nBlocks:\n[1] <nav>Home \| City \| Sports</nav>\n[2] <h1>Volunteers clear three tons of debris</h1>\n[3] <p>Organizers said turnout doubled last year's event.</p>\n[4] <p>The next cleanup is scheduled for May.</p>\n[5] <aside>Popular: Weather \| Traffic</aside>\n\nOutput the index intervals of main content blocks.` |
108
- | assistant | `[[2,4]]` |
109
-
110
- ### Example C — QE (no answer)
111
-
112
- | Role | Content |
113
- |------|---------|
114
- | system | *(QE system prompt above)* |
115
- | user | `URL: https://example.com/page\nQuery: What is the stock price of ACME Corp?\n\nBlocks:\n[1] <h1>Baking tips</h1>\n[2] <p>Preheat the oven to 350°F.</p>\n\nOutput the index intervals of blocks relevant to the query.` |
116
- | assistant | `NA` |
117
-
118
- ---
119
-
120
- ## Runnable inference (Transformers, 8k-capable checkpoint)
121
-
122
- Use **`enable_thinking=False`** on Qwen3 for stable interval-style completions.
123
 
124
  ```python
125
- from transformers import AutoModelForCausalLM, AutoTokenizer
126
- import torch
127
-
128
- SYSTEM_QE = """You are Reading Steiner, a web content extraction model. Given a webpage split into indexed blocks and a user query, identify which blocks contain content relevant to the query.
129
 
130
- Each block is formatted as: [i] <tag>content</tag>
131
- Output the indices of relevant blocks as a Python list of [start, end] intervals (inclusive).
132
- If no relevant content exists, output 'NA'.
133
-
134
- Example output: [[2,4],[7,7],[10,12]]"""
135
-
136
- blocks = """[1] <nav>Home | Science</nav>
137
- [2] <h1>Water on Mars</h1>
138
- [3] <p>Researchers reported trace amounts of perchlorate in regolith samples.</p>
139
- [4] <div class="ad">Subscribe for more space news</div>"""
140
-
141
- user = f"""URL: https://example.com/article
142
- Query: What substance does the article say was detected?
143
-
144
- Blocks:
145
- {blocks}
146
-
147
- Output the index intervals of blocks relevant to the query."""
148
-
149
- messages = [
150
- {"role": "system", "content": SYSTEM_QE},
151
- {"role": "user", "content": user},
152
- ]
153
-
154
- model_id = "OmAlve/reading-steiner"
155
- tokenizer = AutoTokenizer.from_pretrained(model_id)
156
- model = AutoModelForCausalLM.from_pretrained(
157
- model_id, torch_dtype=torch.bfloat16, device_map="auto"
158
- )
159
-
160
- inputs = tokenizer.apply_chat_template(
161
- messages,
162
- tokenize=True,
163
- add_generation_prompt=True,
164
- return_tensors="pt",
165
- enable_thinking=False,
166
- ).to(model.device)
167
-
168
- out = model.generate(inputs, max_new_tokens=128, do_sample=False)
169
- print(tokenizer.decode(out[0][inputs.shape[-1] :], skip_special_tokens=True))
170
  ```
171
 
172
- **ME variant** — same flow; replace system with the **ME** block above and user text with the **ME** template (`URL`, `Title`, `Blocks`, main-content closing line).
173
 
174
- ---
175
 
176
- ## Training summary
177
 
178
- | Setting | Value |
179
- |--------|--------|
180
- | **Max sequence length** | **8192 (8k)** |
181
- | Objective | Causal LM SFT (`messages`) |
182
- | Learning rate | 2e-5, cosine, warmup 5% |
183
- | Epochs | 3 |
184
- | Precision | bf16, gradient checkpointing |
185
- | Eval / save | Every 500 steps; best by `eval_loss` |
186
 
187
- ## Limitations
188
 
189
- Small 0.6B model — validate intervals; match training **system** + **user** layout for best behavior. Derivative of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B); follow its license.
 
 
 
 
190
 
191
  ## Citations
192
 
193
- ```bibtex
194
- @article{indexlm2025,
195
- title={An Index-based Approach for Efficient and Effective Web Content Extraction},
196
- journal={arXiv preprint arXiv:2512.06641},
197
- year={2025},
198
- url={https://arxiv.org/abs/2512.06641}
199
- }
200
- ```
201
 
 
 
 
202
  ```bibtex
203
  @misc{vonwerra2022trl,
204
- title={{TRL: Transformer Reinforcement Learning}},
205
- author={Leandro von Werra and others},
206
- howpublished={\url{https://github.com/huggingface/trl}},
207
- year={2022}
 
 
208
  }
209
- ```
 
1
  ---
 
 
 
 
 
2
  base_model: Qwen/Qwen3-0.6B
3
+ library_name: transformers
4
+ model_name: reading-steiner
5
  tags:
6
+ - generated_from_trainer
7
  - trl
8
  - sft
9
+ licence: license
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
+ # Model Card for reading-steiner
 
 
13
 
14
+ This model is a fine-tuned version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B).
15
+ It has been trained using [TRL](https://github.com/huggingface/trl).
 
 
 
16
 
17
+ ## Quick start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  ```python
20
+ from transformers import pipeline
 
 
 
21
 
22
+ question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
23
+ generator = pipeline("text-generation", model="OmAlve/reading-steiner", device="cuda")
24
+ output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
25
+ print(output["generated_text"])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ```
27
 
28
+ ## Training procedure
29
 
30
+
31
 
 
32
 
33
+ This model was trained with SFT.
 
 
 
 
 
 
 
34
 
35
+ ### Framework versions
36
 
37
+ - TRL: 0.24.0
38
+ - Transformers: 5.5.0
39
+ - Pytorch: 2.5.1+cu124
40
+ - Datasets: 4.3.0
41
+ - Tokenizers: 0.22.2
42
 
43
  ## Citations
44
 
 
 
 
 
 
 
 
 
45
 
46
+
47
+ Cite TRL as:
48
+
49
  ```bibtex
50
  @misc{vonwerra2022trl,
51
+ title = {{TRL: Transformer Reinforcement Learning}},
52
+ author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
53
+ year = 2020,
54
+ journal = {GitHub repository},
55
+ publisher = {GitHub},
56
+ howpublished = {\url{https://github.com/huggingface/trl}}
57
  }
58
+ ```
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:180fcda07b8a0f440d74d4a3fe33cd648aeb70b70594d2a0e3a577ec65c388fe
3
  size 1192135096
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2db0901acf2ac1095e55f08a3d3d467aa4d41eafe77e386841c1d84ded32576b
3
  size 1192135096
runs/Apr24_08-06-45_1eb67182ed08/events.out.tfevents.1777018005.1eb67182ed08.53075.0 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f224409338f39b26c93a7665d42639c290999f162e5cd8df9e7ffb3e201a6b41
3
- size 45007
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:acbdff9c3a12aa0ca1143ae6020e518d4b588e6212f95ca1c74192a762316dba
3
+ size 64546