distill-structure
A fine-tuned Qwen3.5-2B model for HTML structure analysis โ given a compact DOM representation of a web page, it identifies the logical sections and outputs structured JSON.
What it does
Takes a cleaned, heading-stripped HTML page and returns a JSON array describing its sections:
[
{
"title": "Main News Feed Content",
"start_text": "1. Canada's bill C-22 mandates...",
"content_type": "article",
"assets": [{"type": "link", "value": "Canada's bill C-22..."}]
},
{
"title": "Site Footer Navigation",
"start_text": "Guidelines | FAQ | Lists",
"content_type": "footer",
"assets": []
}
]
Use case
This model powers the StructureAgent inside the distill pipeline โ it handles pages with no heading tags where rule-based sectioning fails. The model is trained to recover section structure that headings would normally provide.
Training
- Base model:
Qwen/Qwen3.5-2B - Method: LoRA fine-tuning (r=32, ฮฑ=64) via TRL SFTTrainer
- Dataset: ~3,455 training / 384 eval examples generated from heading-rich web pages (headings stripped and used as labels)
- Epochs: 3 โ Train loss: 1.009 โ Token accuracy: 80.5%
Quick start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, json
model_id = "nahidstaq/distill-structure"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
SYSTEM = (
"You are an HTML structure analyzer. Given a compact DOM representation "
"of a web page (with headings removed), identify the logical sections. "
"Output a JSON array of sections, each with title, start_text, content_type, and assets fields."
)
def analyze(page_title: str, compact_dom: str) -> list[dict]:
messages = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": f"Page: {page_title}\n\n{compact_dom}"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
ids = model.generate(**inputs, max_new_tokens=512, do_sample=False,
pad_token_id=tokenizer.eos_token_id)
raw = tokenizer.decode(ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
return json.loads(raw)
Output fields
| Field | Description |
|---|---|
title |
Short descriptive section title |
start_text |
First ~50 chars of the section's text (for anchoring) |
content_type |
One of: article, list, hero, navigation, footer, table, faq, other |
assets |
Extracted links, images, or list items relevant to the section |
Limitations
- Works best on English pages
- Table-heavy layouts (e.g. nested
<td>) may collapse into fewer sections content_typeclassification skews towardotherfor ambiguous sections
- Downloads last month
- 523