OmAlve commited on
Commit
e8c528d
·
verified ·
1 Parent(s): b13b005

Copy HANDOFF.md from IndexLM-0.6B

Browse files
Files changed (1) hide show
  1. HANDOFF.md +1082 -0
HANDOFF.md ADDED
@@ -0,0 +1,1082 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # IndexLM-0.6B: Index-based Web Content Extraction
2
+
3
+ ## Project Handoff Document
4
+
5
+ **Paper**: [An Index-based Approach for Efficient and Effective Web Content Extraction](https://arxiv.org/abs/2512.06641)
6
+ **Goal**: Fine-tune a SOTA web content extraction model that runs fast on CPU
7
+ **Status**: Dataset prepared & pushed ✅ | Training script ready ✅ | Training NOT yet run ❌
8
+
9
+ ---
10
+
11
+ ## 1. What This Is
12
+
13
+ The paper introduces **IndexLM** — a model that extracts relevant content from web pages by predicting **index intervals** instead of generating full text. This makes it:
14
+ - **10–50× faster** than generative extraction (ReaderLM-v2, Firecrawl, etc.)
15
+ - **SOTA on RAG QA** benchmarks (HotpotQA, NQ, TriviaQA, MuSiQue, MultiHopRAG)
16
+ - **Tiny**: even the 0.6B version beats all baselines
17
+
18
+ The original IndexLM weights are **not publicly released**. This project replicates the approach.
19
+
20
+ ### How It Works
21
+
22
+ 1. HTML is cleaned and split into indexed blocks: `[1] <h1>Title</h1>`, `[2] <p>Content...</p>`, etc.
23
+ 2. The model receives these blocks + a query
24
+ 3. It outputs index intervals like `[[2,4],[7,7],[10,12]]` — identifying which blocks are relevant
25
+ 4. The blocks are reassembled into clean HTML/Markdown
26
+
27
+ Two tasks:
28
+ - **Query-relevant extraction (QE)**: Extract blocks relevant to a specific query
29
+ - **Main content extraction (ME)**: Extract main content, filtering out nav/ads/sidebars
30
+
31
+ ### Paper Results (Table 2 & 3)
32
+
33
+ | Model | Params | Avg RAG QA F1 | ME F1 | QE F1 | Latency (ME) |
34
+ |-------|--------|---------------|-------|-------|-------------|
35
+ | **IndexLM-0.6B** | 0.6B | 54.70 | 83.38 | 28.64 | **0.35s** |
36
+ | **IndexLM-4B** | 4B | 55.41 | 87.40 | 31.69 | 0.81s |
37
+ | ReaderLM-v2 | 1.5B | 46.84 | 68.89 | 13.31 | 11.76s |
38
+ | HtmlRAG | - | 47.00 | 48.65 | 8.83 | 7.12s |
39
+ | Firecrawl Extract | API | 52.72 | - | 29.48 | 11.33s |
40
+
41
+ ---
42
+
43
+ ## 2. What's Been Done
44
+
45
+ ### ✅ Dataset Created & Pushed (v2 — Multi-domain)
46
+
47
+ **Hub**: [`OmAlve/indexlm-training-data`](https://huggingface.co/datasets/OmAlve/indexlm-training-data)
48
+
49
+ | Split | Rows |
50
+ |-------|------|
51
+ | train | 21,098 |
52
+ | eval | 500 |
53
+
54
+ **Domain Composition (avoids Wikipedia-only bias):**
55
+ | Source | Count | % | Domain |
56
+ |--------|-------|---|--------|
57
+ | MultiHopRAG | 7,165 | 33.2% | News (Mashable, CNBC, AP, etc.) |
58
+ | HotpotQA | 6,479 | 30.0% | Wikipedia |
59
+ | HtmlRAG-train | 2,692 | 12.5% | **Real Bing-scraped web HTML** (diverse) |
60
+ | MS MARCO | 4,844 | 22.4% | Diverse web (Bing search results) |
61
+ | NA (mismatched) | 418 | 1.9% | Cross-domain |
62
+
63
+ **Task Type Composition:**
64
+ - `query_relevant`: ~78% — query-specific extraction
65
+ - `main_content`: ~20% — main content vs. noise (nav/ads/cookies)
66
+ - `query_relevant_na`: ~2% — no relevant content exists
67
+
68
+ **Key improvement over v1**: Real web HTML from Bing search results (via HtmlRAG-train) + news articles + MS MARCO diverse web QA, not just Wikipedia.
69
+
70
+ **Format**: Conversational `messages` column (SFTTrainer-native):
71
+ ```json
72
+ {
73
+ "messages": [
74
+ {"role": "system", "content": "You are IndexLM, a web content extraction model..."},
75
+ {"role": "user", "content": "URL: ...\nQuery: ...\n\nBlocks:\n[1] <h2>Title</h2>\n[2] <p>Content</p>\n...\n\nOutput the index intervals of blocks relevant to the query."},
76
+ {"role": "assistant", "content": "[[2, 4], [7, 7]]"}
77
+ ]
78
+ }
79
+ ```
80
+
81
+ **Token length stats** (Qwen3-0.6B tokenizer):
82
+ - Min: 316, Max: 4,105, Mean: 1,944, Median: 2,019
83
+ - 43 examples filtered (>4096 tokens)
84
+
85
+ **Data pipeline** (from `prepare_data_v2.py`):
86
+ 1. **HtmlRAG-train** (5,880 raw examples): Real Bing-scraped HTML from 5 QA datasets (NQ, ASQA, TriviaQA, MuSiQue, HotpotQA). Segments HTML by block-level tags, matches relevant blocks to ground-truth answers using trigram/substring matching.
87
+ 2. **MultiHopRAG** (8,521 examples): News articles from Mashable, CNBC, AP, etc. Converts article body + evidence annotations to indexed blocks. Injects realistic noise blocks.
88
+ 3. **HotpotQA** (6,486 examples, minority): Wikipedia context with supporting facts → index intervals. Noise injected.
89
+ 4. **MS MARCO** (4,844 examples): Diverse web QA from Bing search. Passages from real web pages across numeric, entity, description, location, person query types.
90
+ 5. **NA examples** (500): Mismatched query-page pairs from different sources.
91
+ 6. Filters to ≤4096 tokens, shuffles, splits train/eval.
92
+
93
+ ### ✅ Training Script Ready
94
+
95
+ **File**: `train_indexlm.py` (see Section 5 below)
96
+
97
+ Key settings:
98
+ - **Base model**: `Qwen/Qwen3-0.6B` (751M params, bf16, GQA, 32K context)
99
+ - **Method**: SFT via TRL `SFTTrainer` + `SFTConfig`
100
+ - **Output**: `OmAlve/IndexLM-0.6B` on Hub
101
+ - **Hyperparameters**: lr=2e-5, epochs=3, batch=4, grad_accum=4 (effective BS=16), max_length=4096, cosine LR schedule, warmup=5%
102
+ - `push_to_hub=True`, `hub_model_id="OmAlve/IndexLM-0.6B"`
103
+ - Trackio monitoring included
104
+ - Flash Attention 2 for training speed
105
+
106
+ ### ✅ Evaluation Script Ready
107
+
108
+ **File**: `eval_indexlm.py` (see Section 5 below)
109
+
110
+ Evaluates:
111
+ - QE F1/Precision/Recall on eval split
112
+ - ME F1/Precision/Recall on eval split
113
+ - CPU inference speed benchmark
114
+
115
+ ### ❌ Training Not Yet Run
116
+
117
+ Ran into credits issue on HF Jobs (402 Payment Required). You need to run `train_indexlm.py` on a GPU.
118
+
119
+ ---
120
+
121
+ ## 3. How to Train
122
+
123
+ ### Option A: HF Jobs (if you have credits)
124
+
125
+ ```bash
126
+ # Dependencies
127
+ pip install "transformers>=4.51.0" "trl>=1.2.0" torch datasets accelerate trackio "flash-attn --no-build-isolation"
128
+ ```
129
+
130
+ Recommended hardware: **a10g-large** ($2/hr) or **t4-small** ($0.60/hr) — model is only 0.6B params.
131
+ Estimated time: **2-4 hours** on a10g, **4-6 hours** on T4.
132
+ Set timeout to **6h** minimum.
133
+
134
+ ### Option B: Any GPU machine
135
+
136
+ ```bash
137
+ pip install "transformers>=4.51.0" "trl>=1.2.0" torch datasets accelerate trackio
138
+ pip install flash-attn --no-build-isolation # optional, speeds up training
139
+
140
+ python train_indexlm.py
141
+ ```
142
+
143
+ **VRAM**: ~8-10 GB with gradient checkpointing + bf16 at batch_size=4. Fits on T4 (16GB), any A-series, etc.
144
+
145
+ ### Option C: Without Flash Attention
146
+
147
+ If `flash-attn` fails to install, change this line in `train_indexlm.py`:
148
+ ```python
149
+ # FROM:
150
+ attn_implementation="flash_attention_2",
151
+ # TO:
152
+ attn_implementation="sdpa",
153
+ ```
154
+
155
+ ---
156
+
157
+ ## 4. How to Deploy on CPU
158
+
159
+ After training, the model at `OmAlve/IndexLM-0.6B` can be loaded for CPU inference:
160
+
161
+ ```python
162
+ from transformers import AutoModelForCausalLM, AutoTokenizer
163
+ import torch
164
+
165
+ model = AutoModelForCausalLM.from_pretrained(
166
+ "OmAlve/IndexLM-0.6B",
167
+ torch_dtype=torch.float32,
168
+ attn_implementation="sdpa",
169
+ )
170
+ tokenizer = AutoTokenizer.from_pretrained("OmAlve/IndexLM-0.6B")
171
+ model.eval()
172
+
173
+ # Example: extract relevant content from a web page
174
+ messages = [
175
+ {"role": "system", "content": "You are IndexLM, a web content extraction model..."},
176
+ {"role": "user", "content": "URL: ...\nQuery: What is Python?\n\nBlocks:\n[1] <nav>Home</nav>\n[2] <h1>Python Programming</h1>\n[3] <p>Python is a programming language...</p>\n[4] <footer>Copyright 2024</footer>\n\nOutput the index intervals of blocks relevant to the query."}
177
+ ]
178
+
179
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
180
+ inputs = tokenizer(text, return_tensors="pt")
181
+
182
+ with torch.no_grad():
183
+ out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
184
+
185
+ response = tokenizer.decode(out[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
186
+ print(response) # → [[2, 3]]
187
+ ```
188
+
189
+ **For even faster CPU**: quantize to INT4/INT8 with `bitsandbytes` or export to ONNX.
190
+
191
+ ---
192
+
193
+ ## 5. All Scripts
194
+
195
+ ### 5.1 Data Preparation (`prepare_data.py`)
196
+
197
+ ```python
198
+ """
199
+ Prepare IndexLM training data from HotpotQA and MSMARCO.
200
+
201
+ Pipeline:
202
+ 1. Load HotpotQA (has context = list of (title, sentences) + supporting_facts)
203
+ 2. Convert context into indexed HTML-like blocks: [i] <tag>content</tag>
204
+ 3. The target is index intervals of blocks containing supporting facts
205
+ 4. Also create main-content extraction examples (all content blocks are "main content",
206
+ but we inject noise blocks like nav/ads to train the model to filter them)
207
+ 5. Format as conversational messages for SFT
208
+ """
209
+
210
+ import json
211
+ import random
212
+ import re
213
+ from datasets import load_dataset, Dataset
214
+ from collections import defaultdict
215
+
216
+ random.seed(42)
217
+
218
+ # Noise blocks to inject (simulating real web page clutter)
219
+ NOISE_BLOCKS = [
220
+ '<nav>Home | About | Contact | Privacy Policy</nav>',
221
+ '<div class="ad">Advertisement - Continue Reading Below</div>',
222
+ '<div class="sidebar">Related Articles: Top 10 Facts You Didn\'t Know</div>',
223
+ '<footer>© 2024 All Rights Reserved | Terms of Service</footer>',
224
+ '<div class="cookie-banner">This site uses cookies. Accept | Decline</div>',
225
+ '<div class="social">Share on: Twitter | Facebook | LinkedIn</div>',
226
+ '<nav class="breadcrumb">Home > Category > Subcategory > Article</nav>',
227
+ '<div class="newsletter">Subscribe to our newsletter for updates</div>',
228
+ '<div class="popup">Sign up for free access to premium content</div>',
229
+ '<aside>Trending: Latest news and popular stories</aside>',
230
+ '<div class="comments">Comments (0) - Be the first to comment</div>',
231
+ '<div class="author">Written by Staff Reporter | Updated: Jan 2024</div>',
232
+ '<div class="pagination">Previous | 1 | 2 | 3 | Next</div>',
233
+ '<div class="search">Search this site...</div>',
234
+ '<div class="menu">Categories: Science, Tech, Health, Sports</div>',
235
+ ]
236
+
237
+ SYSTEM_PROMPT_QE = """You are IndexLM, a web content extraction model. Given a webpage split into indexed blocks and a user query, identify which blocks contain content relevant to the query.
238
+
239
+ Each block is formatted as: [i] <tag>content</tag>
240
+ Output the indices of relevant blocks as a Python list of [start, end] intervals (inclusive).
241
+ If no relevant content exists, output 'NA'.
242
+
243
+ Example output: [[2,4],[7,7],[10,12]]"""
244
+
245
+ SYSTEM_PROMPT_ME = """You are IndexLM, a web content extraction model. Given a webpage split into indexed blocks, identify which blocks contain the main content of the page (filtering out navigation, advertisements, sidebars, and other non-content elements).
246
+
247
+ Each block is formatted as: [i] <tag>content</tag>
248
+ Output the indices of main content blocks as a Python list of [start, end] intervals (inclusive).
249
+ If no main content exists, output 'NA'.
250
+
251
+ Example output: [[1,3],[5,8],[11,15]]"""
252
+
253
+
254
+ def indices_to_intervals(indices):
255
+ """Convert a sorted list of indices to intervals [[start,end], ...]"""
256
+ if not indices:
257
+ return "NA"
258
+ indices = sorted(set(indices))
259
+ intervals = []
260
+ start = indices[0]
261
+ end = indices[0]
262
+ for i in indices[1:]:
263
+ if i == end + 1:
264
+ end = i
265
+ else:
266
+ intervals.append([start, end])
267
+ start = i
268
+ end = i
269
+ intervals.append([start, end])
270
+ return json.dumps(intervals)
271
+
272
+
273
+ def create_indexed_blocks_from_hotpotqa(context, supporting_facts, inject_noise=True):
274
+ """
275
+ Convert HotpotQA context into indexed HTML blocks.
276
+
277
+ context: {'title': [...], 'sentences': [[...], ...]}
278
+ supporting_facts: {'title': [...], 'sent_id': [...]}
279
+
280
+ Returns: (block_text, relevant_indices, all_content_indices)
281
+ """
282
+ titles = context['title']
283
+ sentences_list = context['sentences']
284
+
285
+ # Build supporting facts lookup
286
+ sf_lookup = defaultdict(set)
287
+ for title, sent_id in zip(supporting_facts['title'], supporting_facts['sent_id']):
288
+ sf_lookup[title].add(sent_id)
289
+
290
+ blocks = []
291
+ relevant_indices = []
292
+ content_indices = [] # All real content (non-noise)
293
+
294
+ idx = 1
295
+
296
+ for doc_idx, (title, sentences) in enumerate(zip(titles, sentences_list)):
297
+ # Title block
298
+ blocks.append(f"[{idx}] <h2>{title}</h2>")
299
+ content_indices.append(idx)
300
+ if title in sf_lookup:
301
+ # Title of a supporting document is relevant
302
+ relevant_indices.append(idx)
303
+ idx += 1
304
+
305
+ # Sentence blocks
306
+ for sent_idx, sentence in enumerate(sentences):
307
+ sentence = sentence.strip()
308
+ if not sentence:
309
+ continue
310
+
311
+ # Use <p> for regular text
312
+ blocks.append(f"[{idx}] <p>{sentence}</p>")
313
+ content_indices.append(idx)
314
+
315
+ if title in sf_lookup and sent_idx in sf_lookup[title]:
316
+ relevant_indices.append(idx)
317
+ idx += 1
318
+
319
+ # Inject noise between documents sometimes
320
+ if inject_noise and random.random() < 0.4 and doc_idx < len(titles) - 1:
321
+ noise = random.choice(NOISE_BLOCKS)
322
+ blocks.append(f"[{idx}] {noise}")
323
+ idx += 1
324
+
325
+ # Sometimes add noise at start and end
326
+ if inject_noise:
327
+ prefix_noise = []
328
+ if random.random() < 0.5:
329
+ for _ in range(random.randint(1, 3)):
330
+ noise = random.choice(NOISE_BLOCKS)
331
+ prefix_noise.append(noise)
332
+
333
+ suffix_noise = []
334
+ if random.random() < 0.5:
335
+ for _ in range(random.randint(1, 3)):
336
+ noise = random.choice(NOISE_BLOCKS)
337
+ suffix_noise.append(noise)
338
+
339
+ if prefix_noise or suffix_noise:
340
+ # Reindex everything
341
+ new_blocks = []
342
+ new_relevant = []
343
+ new_content = []
344
+ new_idx = 1
345
+
346
+ # Prefix noise
347
+ for noise in prefix_noise:
348
+ new_blocks.append(f"[{new_idx}] {noise}")
349
+ new_idx += 1
350
+
351
+ # Remap original blocks
352
+ offset = len(prefix_noise)
353
+ for b in blocks:
354
+ old_idx = int(b.split(']')[0].replace('[', ''))
355
+ new_b = f"[{old_idx + offset}] " + '] '.join(b.split('] ')[1:])
356
+ new_blocks.append(new_b)
357
+
358
+ new_relevant = [r + offset for r in relevant_indices]
359
+ new_content = [c + offset for c in content_indices]
360
+
361
+ # Suffix noise
362
+ next_idx = len(new_blocks) + 1
363
+ for noise in suffix_noise:
364
+ new_blocks.append(f"[{next_idx}] {noise}")
365
+ next_idx += 1
366
+
367
+ blocks = new_blocks
368
+ relevant_indices = new_relevant
369
+ content_indices = new_content
370
+
371
+ block_text = "\n".join(blocks)
372
+ return block_text, relevant_indices, content_indices
373
+
374
+
375
+ def build_query_relevant_example(question, block_text, relevant_indices, url="https://en.wikipedia.org"):
376
+ """Build a query-relevant extraction (QE) example."""
377
+ intervals = indices_to_intervals(relevant_indices)
378
+
379
+ user_content = f"URL: {url}\nQuery: {question}\n\nBlocks:\n{block_text}\n\nOutput the index intervals of blocks relevant to the query."
380
+
381
+ messages = [
382
+ {"role": "system", "content": SYSTEM_PROMPT_QE},
383
+ {"role": "user", "content": user_content},
384
+ {"role": "assistant", "content": intervals}
385
+ ]
386
+ return messages
387
+
388
+
389
+ def build_main_content_example(block_text, content_indices, title="Wikipedia Article", url="https://en.wikipedia.org"):
390
+ """Build a main content extraction (ME) example."""
391
+ intervals = indices_to_intervals(content_indices)
392
+
393
+ user_content = f"URL: {url}\nTitle: {title}\n\nBlocks:\n{block_text}\n\nOutput the index intervals of main content blocks."
394
+
395
+ messages = [
396
+ {"role": "system", "content": SYSTEM_PROMPT_ME},
397
+ {"role": "user", "content": user_content},
398
+ {"role": "assistant", "content": intervals}
399
+ ]
400
+ return messages
401
+
402
+
403
+ def process_hotpotqa():
404
+ """Process HotpotQA into IndexLM training data."""
405
+ print("Loading HotpotQA...")
406
+ ds = load_dataset("hotpotqa/hotpot_qa", "distractor", split="train")
407
+
408
+ # Sample a manageable amount
409
+ num_samples = min(15000, len(ds))
410
+ ds = ds.shuffle(seed=42).select(range(num_samples))
411
+
412
+ all_examples = []
413
+ skipped = 0
414
+
415
+ for i, row in enumerate(ds):
416
+ if i % 1000 == 0:
417
+ print(f"Processing {i}/{num_samples}...")
418
+
419
+ try:
420
+ block_text, relevant_indices, content_indices = create_indexed_blocks_from_hotpotqa(
421
+ row['context'], row['supporting_facts'], inject_noise=True
422
+ )
423
+
424
+ # Skip if too few relevant indices
425
+ if len(relevant_indices) < 1:
426
+ skipped += 1
427
+ continue
428
+
429
+ # Query-relevant extraction example
430
+ qe_messages = build_query_relevant_example(
431
+ row['question'], block_text, relevant_indices
432
+ )
433
+ all_examples.append({
434
+ "messages": qe_messages,
435
+ "task_type": "query_relevant",
436
+ "source": "hotpotqa"
437
+ })
438
+
439
+ # Main content extraction example (50% of the time)
440
+ if random.random() < 0.5:
441
+ me_messages = build_main_content_example(
442
+ block_text, content_indices,
443
+ title=row['context']['title'][0] if row['context']['title'] else "Article"
444
+ )
445
+ all_examples.append({
446
+ "messages": me_messages,
447
+ "task_type": "main_content",
448
+ "source": "hotpotqa"
449
+ })
450
+ except Exception as e:
451
+ skipped += 1
452
+ if skipped < 5:
453
+ print(f"Error on row {i}: {e}")
454
+ continue
455
+
456
+ print(f"Created {len(all_examples)} examples from HotpotQA ({skipped} skipped)")
457
+ return all_examples
458
+
459
+
460
+ def create_synthetic_web_pages():
461
+ """Create synthetic web page examples for main content extraction training."""
462
+ print("Creating synthetic web page examples...")
463
+
464
+ # Load a text dataset to get content
465
+ ds = load_dataset("hotpotqa/hotpot_qa", "distractor", split="validation")
466
+ ds = ds.shuffle(seed=123).select(range(3000))
467
+
468
+ examples = []
469
+
470
+ for i, row in enumerate(ds):
471
+ if i % 500 == 0:
472
+ print(f"Synthetic page {i}/3000...")
473
+
474
+ try:
475
+ # Build a more realistic web page structure
476
+ titles = row['context']['title']
477
+ sentences_list = row['context']['sentences']
478
+
479
+ if not titles or not sentences_list:
480
+ continue
481
+
482
+ blocks = []
483
+ content_indices = []
484
+ idx = 1
485
+
486
+ # Header noise (nav, etc.)
487
+ num_header_noise = random.randint(1, 4)
488
+ for _ in range(num_header_noise):
489
+ blocks.append(f"[{idx}] {random.choice(NOISE_BLOCKS)}")
490
+ idx += 1
491
+
492
+ # Page title
493
+ main_title = titles[0]
494
+ blocks.append(f"[{idx}] <h1>{main_title}</h1>")
495
+ content_indices.append(idx)
496
+ idx += 1
497
+
498
+ # Main content (just first 1-3 documents)
499
+ num_docs = min(random.randint(1, 3), len(titles))
500
+ for doc_idx in range(num_docs):
501
+ title = titles[doc_idx]
502
+ sents = sentences_list[doc_idx]
503
+
504
+ if doc_idx > 0:
505
+ blocks.append(f"[{idx}] <h2>{title}</h2>")
506
+ content_indices.append(idx)
507
+ idx += 1
508
+
509
+ for sent in sents:
510
+ sent = sent.strip()
511
+ if not sent:
512
+ continue
513
+ blocks.append(f"[{idx}] <p>{sent}</p>")
514
+ content_indices.append(idx)
515
+ idx += 1
516
+
517
+ # Occasional inline noise
518
+ if random.random() < 0.3:
519
+ blocks.append(f"[{idx}] {random.choice(NOISE_BLOCKS)}")
520
+ idx += 1
521
+
522
+ # Footer noise
523
+ num_footer_noise = random.randint(1, 4)
524
+ for _ in range(num_footer_noise):
525
+ blocks.append(f"[{idx}] {random.choice(NOISE_BLOCKS)}")
526
+ idx += 1
527
+
528
+ block_text = "\n".join(blocks)
529
+ me_messages = build_main_content_example(
530
+ block_text, content_indices,
531
+ title=main_title,
532
+ url=f"https://en.wikipedia.org/wiki/{main_title.replace(' ', '_')}"
533
+ )
534
+ examples.append({
535
+ "messages": me_messages,
536
+ "task_type": "main_content",
537
+ "source": "synthetic"
538
+ })
539
+ except Exception as e:
540
+ continue
541
+
542
+ print(f"Created {len(examples)} synthetic web page examples")
543
+ return examples
544
+
545
+
546
+ def create_na_examples():
547
+ """Create examples where no relevant content exists (model should output 'NA')."""
548
+ print("Creating NA examples...")
549
+ ds = load_dataset("hotpotqa/hotpot_qa", "distractor", split="validation")
550
+ ds = ds.shuffle(seed=456).select(range(1000))
551
+
552
+ examples = []
553
+
554
+ for i, row in enumerate(ds):
555
+ try:
556
+ # Use context from one question but query from another (mismatched)
557
+ other_idx = (i + 500) % len(ds)
558
+ other_question = ds[other_idx]['question']
559
+
560
+ # Build blocks from current context but keep only non-supporting content
561
+ block_text, _, content_indices = create_indexed_blocks_from_hotpotqa(
562
+ row['context'], {'title': [], 'sent_id': []}, inject_noise=True
563
+ )
564
+
565
+ user_content = f"URL: https://en.wikipedia.org\nQuery: {other_question}\n\nBlocks:\n{block_text}\n\nOutput the index intervals of blocks relevant to the query."
566
+
567
+ messages = [
568
+ {"role": "system", "content": SYSTEM_PROMPT_QE},
569
+ {"role": "user", "content": user_content},
570
+ {"role": "assistant", "content": "NA"}
571
+ ]
572
+ examples.append({
573
+ "messages": messages,
574
+ "task_type": "query_relevant_na",
575
+ "source": "hotpotqa_mismatched"
576
+ })
577
+ except:
578
+ continue
579
+
580
+ # Keep only a fraction (the paper mentions partial filtering of NA)
581
+ random.shuffle(examples)
582
+ examples = examples[:300]
583
+ print(f"Created {len(examples)} NA examples")
584
+ return examples
585
+
586
+
587
+ def main():
588
+ # Build all training examples
589
+ qe_examples = process_hotpotqa()
590
+ me_examples = create_synthetic_web_pages()
591
+ na_examples = create_na_examples()
592
+
593
+ all_examples = qe_examples + me_examples + na_examples
594
+ random.shuffle(all_examples)
595
+
596
+ print(f"\nTotal examples: {len(all_examples)}")
597
+
598
+ # Count by type
599
+ type_counts = defaultdict(int)
600
+ for ex in all_examples:
601
+ type_counts[ex['task_type']] += 1
602
+ for t, c in type_counts.items():
603
+ print(f" {t}: {c}")
604
+
605
+ # Check lengths
606
+ from transformers import AutoTokenizer
607
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
608
+
609
+ lengths = []
610
+ for ex in all_examples[:500]:
611
+ text = tokenizer.apply_chat_template(ex['messages'], tokenize=False)
612
+ tokens = tokenizer.encode(text)
613
+ lengths.append(len(tokens))
614
+
615
+ print(f"\nToken length stats (sample of 500):")
616
+ print(f" Min: {min(lengths)}")
617
+ print(f" Max: {max(lengths)}")
618
+ print(f" Mean: {sum(lengths)/len(lengths):.0f}")
619
+ print(f" Median: {sorted(lengths)[len(lengths)//2]}")
620
+
621
+ # Filter out examples that are too long (>4096 tokens for efficiency)
622
+ MAX_LEN = 4096
623
+ filtered = []
624
+ too_long = 0
625
+ for ex in all_examples:
626
+ text = tokenizer.apply_chat_template(ex['messages'], tokenize=False)
627
+ tokens = tokenizer.encode(text)
628
+ if len(tokens) <= MAX_LEN:
629
+ filtered.append(ex)
630
+ else:
631
+ too_long += 1
632
+
633
+ print(f"\nFiltered: {too_long} examples too long (>{MAX_LEN} tokens)")
634
+ print(f"Final dataset: {len(filtered)} examples")
635
+
636
+ # Split into train/eval
637
+ random.shuffle(filtered)
638
+ eval_size = min(500, len(filtered) // 10)
639
+ train_data = filtered[:-eval_size]
640
+ eval_data = filtered[-eval_size:]
641
+
642
+ print(f"Train: {len(train_data)}, Eval: {len(eval_data)}")
643
+
644
+ # Create HF dataset with just messages column (for SFTTrainer)
645
+ train_ds = Dataset.from_list([{"messages": ex["messages"]} for ex in train_data])
646
+ eval_ds = Dataset.from_list([{"messages": ex["messages"]} for ex in eval_data])
647
+
648
+ # Save locally
649
+ train_ds.save_to_disk("/app/indexlm_train")
650
+ eval_ds.save_to_disk("/app/indexlm_eval")
651
+
652
+ # Also push to HF Hub
653
+ from datasets import DatasetDict
654
+ import os
655
+ ds_dict = DatasetDict({"train": train_ds, "eval": eval_ds})
656
+ ds_dict.push_to_hub("OmAlve/indexlm-training-data", token=os.environ.get("HF_TOKEN"))
657
+
658
+ print("\nDone! Dataset pushed to OmAlve/indexlm-training-data")
659
+
660
+
661
+ if __name__ == "__main__":
662
+ main()
663
+ ```
664
+
665
+ ### 5.2 Training Script (`train_indexlm.py`)
666
+
667
+ ```python
668
+ """
669
+ IndexLM Training Script - Fine-tune Qwen3-0.6B for Index-based Web Content Extraction
670
+
671
+ Based on: "An Index-based Approach for Efficient and Effective Web Content Extraction" (arxiv:2512.06641)
672
+ Base model: Qwen/Qwen3-0.6B (0.6B params, ideal for CPU deployment)
673
+ Training method: SFT with TRL SFTTrainer
674
+ Dataset: OmAlve/indexlm-training-data (25K+ examples)
675
+ """
676
+
677
+ import os
678
+ import torch
679
+ from datasets import load_dataset
680
+ from transformers import AutoModelForCausalLM, AutoTokenizer
681
+ from trl import SFTTrainer, SFTConfig
682
+ import trackio
683
+
684
+ # ============ Configuration ============
685
+ MODEL_ID = "Qwen/Qwen3-0.6B"
686
+ DATASET_ID = "OmAlve/indexlm-training-data"
687
+ OUTPUT_DIR = "./indexlm-0.6b"
688
+ HUB_MODEL_ID = "OmAlve/IndexLM-0.6B"
689
+
690
+ # Training hyperparameters (from paper: standard SFT)
691
+ LEARNING_RATE = 2e-5
692
+ NUM_EPOCHS = 3
693
+ BATCH_SIZE = 4
694
+ GRAD_ACCUM = 4 # Effective batch size = 16
695
+ MAX_SEQ_LENGTH = 4096
696
+ WARMUP_RATIO = 0.05
697
+
698
+ # ============ Setup Trackio ============
699
+ trackio.init(
700
+ name="indexlm-0.6b-training",
701
+ project="indexlm"
702
+ )
703
+
704
+ # ============ Load Dataset ============
705
+ print("Loading dataset...")
706
+ dataset = load_dataset(DATASET_ID)
707
+ train_dataset = dataset["train"]
708
+ eval_dataset = dataset["eval"]
709
+ print(f"Train: {len(train_dataset)}, Eval: {len(eval_dataset)}")
710
+
711
+ # ============ Load Model & Tokenizer ============
712
+ print("Loading model and tokenizer...")
713
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
714
+
715
+ # Ensure padding token is set
716
+ if tokenizer.pad_token is None:
717
+ tokenizer.pad_token = tokenizer.eos_token
718
+
719
+ model = AutoModelForCausalLM.from_pretrained(
720
+ MODEL_ID,
721
+ torch_dtype=torch.bfloat16,
722
+ attn_implementation="flash_attention_2", # Change to "sdpa" if flash-attn unavailable
723
+ )
724
+
725
+ print(f"Model loaded: {MODEL_ID}")
726
+ print(f"Model params: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")
727
+
728
+ # ============ Training Config ============
729
+ training_args = SFTConfig(
730
+ output_dir=OUTPUT_DIR,
731
+ num_train_epochs=NUM_EPOCHS,
732
+ per_device_train_batch_size=BATCH_SIZE,
733
+ per_device_eval_batch_size=BATCH_SIZE,
734
+ gradient_accumulation_steps=GRAD_ACCUM,
735
+ learning_rate=LEARNING_RATE,
736
+ lr_scheduler_type="cosine",
737
+ warmup_ratio=WARMUP_RATIO,
738
+ weight_decay=0.01,
739
+ bf16=True,
740
+ gradient_checkpointing=True,
741
+ max_length=MAX_SEQ_LENGTH,
742
+ # Logging
743
+ logging_steps=10,
744
+ logging_first_step=True,
745
+ logging_strategy="steps",
746
+ disable_tqdm=True,
747
+ # Evaluation
748
+ eval_strategy="steps",
749
+ eval_steps=500,
750
+ # Saving
751
+ save_strategy="steps",
752
+ save_steps=500,
753
+ save_total_limit=3,
754
+ load_best_model_at_end=True,
755
+ metric_for_best_model="eval_loss",
756
+ greater_is_better=False,
757
+ # Hub push
758
+ push_to_hub=True,
759
+ hub_model_id=HUB_MODEL_ID,
760
+ hub_strategy="every_save",
761
+ # Performance
762
+ dataloader_num_workers=4,
763
+ dataloader_pin_memory=True,
764
+ # Report
765
+ report_to="none",
766
+ # Seed
767
+ seed=42,
768
+ )
769
+
770
+ # ============ Initialize Trainer ============
771
+ print("Initializing trainer...")
772
+ trainer = SFTTrainer(
773
+ model=model,
774
+ args=training_args,
775
+ train_dataset=train_dataset,
776
+ eval_dataset=eval_dataset,
777
+ processing_class=tokenizer,
778
+ )
779
+
780
+ # ============ Train ============
781
+ print("Starting training...")
782
+ train_result = trainer.train()
783
+
784
+ # ============ Save Final Model ============
785
+ print("Saving final model...")
786
+ trainer.save_model(OUTPUT_DIR)
787
+ tokenizer.save_pretrained(OUTPUT_DIR)
788
+
789
+ # Push to Hub
790
+ print("Pushing to Hub...")
791
+ trainer.push_to_hub(commit_message="Final IndexLM-0.6B model")
792
+
793
+ # ============ Log Final Metrics ============
794
+ metrics = train_result.metrics
795
+ print(f"\nTraining complete!")
796
+ print(f" Train loss: {metrics.get('train_loss', 'N/A')}")
797
+ print(f" Train runtime: {metrics.get('train_runtime', 'N/A'):.0f}s")
798
+ print(f" Train samples/sec: {metrics.get('train_samples_per_second', 'N/A'):.1f}")
799
+
800
+ # Final eval
801
+ eval_metrics = trainer.evaluate()
802
+ print(f" Eval loss: {eval_metrics.get('eval_loss', 'N/A')}")
803
+
804
+ print(f"\nModel pushed to: https://huggingface.co/{HUB_MODEL_ID}")
805
+ ```
806
+
807
+ ### 5.3 Evaluation Script (`eval_indexlm.py`)
808
+
809
+ ```python
810
+ """
811
+ IndexLM Evaluation Script
812
+
813
+ Tests the trained model on:
814
+ 1. Query-relevant extraction (QE) - F1/Precision/Recall
815
+ 2. Main content extraction (ME) - F1/Precision/Recall
816
+ 3. Inference speed on CPU
817
+ """
818
+
819
+ import json
820
+ import time
821
+ import os
822
+ import torch
823
+ from datasets import load_dataset
824
+ from transformers import AutoModelForCausalLM, AutoTokenizer
825
+
826
+
827
+ def parse_intervals(text):
828
+ """Parse interval string like '[[1,3],[5,7]]' into a set of indices."""
829
+ text = text.strip()
830
+ if text.upper() == 'NA' or not text:
831
+ return set()
832
+ try:
833
+ intervals = json.loads(text)
834
+ indices = set()
835
+ for start, end in intervals:
836
+ indices.update(range(start, end + 1))
837
+ return indices
838
+ except (json.JSONDecodeError, TypeError, ValueError):
839
+ return set()
840
+
841
+
842
+ def compute_f1(pred_indices, gold_indices):
843
+ """Compute F1, precision, recall between two sets of indices."""
844
+ if not pred_indices and not gold_indices:
845
+ return 1.0, 1.0, 1.0
846
+ if not pred_indices or not gold_indices:
847
+ return 0.0, 0.0, 0.0
848
+
849
+ tp = len(pred_indices & gold_indices)
850
+ precision = tp / len(pred_indices) if pred_indices else 0
851
+ recall = tp / len(gold_indices) if gold_indices else 0
852
+ f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
853
+ return f1, precision, recall
854
+
855
+
856
+ def generate_response(model, tokenizer, messages, device, max_new_tokens=128):
857
+ """Generate model response for given messages."""
858
+ text = tokenizer.apply_chat_template(
859
+ messages[:-1], # Exclude assistant message (ground truth)
860
+ tokenize=False,
861
+ add_generation_prompt=True,
862
+ enable_thinking=False,
863
+ )
864
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096).to(device)
865
+
866
+ with torch.no_grad():
867
+ outputs = model.generate(
868
+ **inputs,
869
+ max_new_tokens=max_new_tokens,
870
+ do_sample=False, # Greedy for deterministic eval
871
+ temperature=1.0,
872
+ pad_token_id=tokenizer.pad_token_id,
873
+ )
874
+
875
+ # Decode only the new tokens
876
+ new_tokens = outputs[0][inputs['input_ids'].shape[1]:]
877
+ response = tokenizer.decode(new_tokens, skip_special_tokens=True)
878
+ return response.strip()
879
+
880
+
881
+ def evaluate_model(model_id, device="cpu", num_samples=100):
882
+ """Run full evaluation."""
883
+ print(f"\n{'='*60}")
884
+ print(f"Evaluating: {model_id}")
885
+ print(f"Device: {device}")
886
+ print(f"{'='*60}")
887
+
888
+ # Load model
889
+ print("Loading model...")
890
+ dtype = torch.float32 if device == "cpu" else torch.bfloat16
891
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
892
+ model = AutoModelForCausalLM.from_pretrained(
893
+ model_id,
894
+ torch_dtype=dtype,
895
+ attn_implementation="sdpa",
896
+ ).to(device)
897
+ model.eval()
898
+
899
+ # Load eval dataset
900
+ print("Loading eval dataset...")
901
+ dataset = load_dataset("OmAlve/indexlm-training-data", split="eval")
902
+
903
+ # Sample
904
+ if len(dataset) > num_samples:
905
+ dataset = dataset.shuffle(seed=42).select(range(num_samples))
906
+
907
+ # Categorize examples
908
+ qe_examples = []
909
+ me_examples = []
910
+
911
+ for row in dataset:
912
+ msgs = row['messages']
913
+ system_msg = msgs[0]['content'] if msgs[0]['role'] == 'system' else ''
914
+ if 'query' in system_msg.lower() and 'relevant' in system_msg.lower():
915
+ qe_examples.append(msgs)
916
+ else:
917
+ me_examples.append(msgs)
918
+
919
+ print(f"QE examples: {len(qe_examples)}, ME examples: {len(me_examples)}")
920
+
921
+ # Evaluate QE
922
+ print("\n--- Query-Relevant Extraction (QE) ---")
923
+ qe_metrics = evaluate_task(model, tokenizer, qe_examples[:50], device)
924
+
925
+ # Evaluate ME
926
+ print("\n--- Main Content Extraction (ME) ---")
927
+ me_metrics = evaluate_task(model, tokenizer, me_examples[:50], device)
928
+
929
+ # Speed test
930
+ print("\n--- Inference Speed Test ---")
931
+ speed_test(model, tokenizer, qe_examples[:20], device)
932
+
933
+ return qe_metrics, me_metrics
934
+
935
+
936
+ def evaluate_task(model, tokenizer, examples, device):
937
+ """Evaluate on a set of examples."""
938
+ if not examples:
939
+ print("No examples for this task.")
940
+ return {}
941
+
942
+ f1_scores = []
943
+ precision_scores = []
944
+ recall_scores = []
945
+ exact_matches = 0
946
+
947
+ for i, msgs in enumerate(examples):
948
+ gold = msgs[-1]['content']
949
+ gold_indices = parse_intervals(gold)
950
+
951
+ pred = generate_response(model, tokenizer, msgs, device)
952
+ pred_indices = parse_intervals(pred)
953
+
954
+ f1, prec, rec = compute_f1(pred_indices, gold_indices)
955
+ f1_scores.append(f1)
956
+ precision_scores.append(prec)
957
+ recall_scores.append(rec)
958
+
959
+ if pred_indices == gold_indices:
960
+ exact_matches += 1
961
+
962
+ if i < 3:
963
+ print(f" Example {i+1}:")
964
+ print(f" Gold: {gold}")
965
+ print(f" Pred: {pred}")
966
+ print(f" F1: {f1:.3f}, P: {prec:.3f}, R: {rec:.3f}")
967
+
968
+ avg_f1 = sum(f1_scores) / len(f1_scores) * 100
969
+ avg_prec = sum(precision_scores) / len(precision_scores) * 100
970
+ avg_rec = sum(recall_scores) / len(recall_scores) * 100
971
+ em_rate = exact_matches / len(examples) * 100
972
+
973
+ print(f"\n Results ({len(examples)} examples):")
974
+ print(f" F1: {avg_f1:.2f}")
975
+ print(f" Precision: {avg_prec:.2f}")
976
+ print(f" Recall: {avg_rec:.2f}")
977
+ print(f" Exact Match: {em_rate:.2f}%")
978
+
979
+ return {"f1": avg_f1, "precision": avg_prec, "recall": avg_rec, "exact_match": em_rate}
980
+
981
+
982
+ def speed_test(model, tokenizer, examples, device):
983
+ """Test inference speed."""
984
+ if not examples:
985
+ return
986
+
987
+ times = []
988
+ for msgs in examples:
989
+ start = time.time()
990
+ _ = generate_response(model, tokenizer, msgs, device)
991
+ elapsed = time.time() - start
992
+ times.append(elapsed)
993
+
994
+ avg_time = sum(times) / len(times)
995
+ print(f" Average inference time: {avg_time:.3f}s ({device})")
996
+ print(f" Min: {min(times):.3f}s, Max: {max(times):.3f}s")
997
+ print(f" Throughput: {1/avg_time:.1f} pages/sec")
998
+
999
+
1000
+ if __name__ == "__main__":
1001
+ model_id = os.environ.get("MODEL_ID", "OmAlve/IndexLM-0.6B")
1002
+ device = "cuda" if torch.cuda.is_available() else "cpu"
1003
+ evaluate_model(model_id, device=device, num_samples=100)
1004
+ ```
1005
+
1006
+ ---
1007
+
1008
+ ## 6. Key Design Decisions & Rationale
1009
+
1010
+ ### Why Qwen3-0.6B?
1011
+ - The paper uses Qwen3-0.6B/1.7B/4B. The 0.6B achieves **near-identical performance** to 4B on RAG QA (54.70 vs 55.41 avg F1)
1012
+ - 0.6B is **1.4GB in bf16, ~700MB in INT4** — runs fast on CPU
1013
+ - TRL's own SFT documentation uses Qwen3-0.6B as its default example model — maximum compatibility
1014
+ - Qwen3 has GQA (grouped-query attention) which is faster for inference than MHA
1015
+
1016
+ ### Why not ReaderLM-v2?
1017
+ - ReaderLM-v2 does generative HTML→Markdown extraction (different task)
1018
+ - It's **33-70× slower** than IndexLM on the paper's benchmarks
1019
+ - Fine-tuning it for index prediction would fight against its pretrained generation behavior
1020
+
1021
+ ### Dataset construction vs. the paper
1022
+ The paper uses:
1023
+ 1. Google Search API crawls → real HTML from the web
1024
+ 2. DeepSeek V3 annotation with 5-run majority voting
1025
+ 3. Common Crawl WARC files
1026
+
1027
+ We approximate this with:
1028
+ 1. HotpotQA's structured context (title + sentences) converted to indexed HTML blocks
1029
+ 2. Programmatic labeling from HotpotQA's `supporting_facts` ground truth (higher quality than LLM annotation)
1030
+ 3. Synthetic noise injection (nav, ads, cookies, etc.) to simulate real web clutter
1031
+ 4. Mismatched query-page pairs for NA examples
1032
+
1033
+ **Trade-off**: Our HTML blocks are simpler than real web HTML (no nested tables, complex CSS-in-JS, etc.). For production use, augmenting with real crawled HTML would improve robustness. The paper's full pipeline would require API costs (Google Search, DeepSeek V3).
1034
+
1035
+ ### Hyperparameters
1036
+ Directly from the paper Section 3.3.2: "The training process is a typical SFT process" on Qwen3. We use:
1037
+ - lr=2e-5 (TRL SFT default, standard for Qwen3)
1038
+ - 3 epochs (standard SFT)
1039
+ - Effective batch size 16 (4 × 4 grad accum)
1040
+ - Cosine LR schedule with 5% warmup
1041
+ - max_length=4096 (covers 99.8% of our data, well within Qwen3's 32K context)
1042
+
1043
+ ---
1044
+
1045
+ ## 7. What's Left To Do
1046
+
1047
+ | Task | Status | Notes |
1048
+ |------|--------|-------|
1049
+ | Run `train_indexlm.py` | ❌ | Needs GPU — a10g-large recommended (~$8 total) |
1050
+ | Run `eval_indexlm.py` | ❌ | After training completes |
1051
+ | ONNX export for CPU | ❌ | Optional: `optimum-cli export onnx --model OmAlve/IndexLM-0.6B indexlm-onnx/` |
1052
+ | INT4 quantization | ❌ | Optional: use `bitsandbytes` or `llama.cpp` for faster CPU |
1053
+ | Real HTML augmentation | ❌ | Optional: crawl real web pages to augment training data |
1054
+
1055
+ ---
1056
+
1057
+ ## 8. Resources
1058
+
1059
+ | Resource | URL |
1060
+ |----------|-----|
1061
+ | Paper | https://arxiv.org/abs/2512.06641 |
1062
+ | Training dataset | https://huggingface.co/datasets/OmAlve/indexlm-training-data |
1063
+ | Base model | https://huggingface.co/Qwen/Qwen3-0.6B |
1064
+ | Output model (after training) | https://huggingface.co/OmAlve/IndexLM-0.6B |
1065
+ | TRL SFT docs | https://huggingface.co/docs/trl/sft_trainer |
1066
+ | HotpotQA source | https://huggingface.co/datasets/hotpotqa/hotpot_qa |
1067
+
1068
+ ---
1069
+
1070
+ ## 9. Dependencies
1071
+
1072
+ ```
1073
+ transformers>=4.51.0
1074
+ trl>=1.2.0
1075
+ torch
1076
+ datasets
1077
+ accelerate
1078
+ trackio
1079
+ flash-attn # optional, GPU training only
1080
+ beautifulsoup4 # only for prepare_data.py
1081
+ lxml # only for prepare_data.py
1082
+ ```