distill-structure

A fine-tuned Qwen3.5-2B model for HTML structure analysis — given a compact DOM representation of a web page, it identifies the logical sections and outputs structured JSON.

What it does

Takes a cleaned, heading-stripped HTML page and returns a JSON array describing its sections:

[
  {
    "title": "Main News Feed Content",
    "start_text": "1. Canada's bill C-22 mandates...",
    "content_type": "article",
    "assets": [{"type": "link", "value": "Canada's bill C-22..."}]
  },
  {
    "title": "Site Footer Navigation",
    "start_text": "Guidelines | FAQ | Lists",
    "content_type": "footer",
    "assets": []
  }
]

Use case

This model powers the StructureAgent inside the distill pipeline — it handles pages with no heading tags where rule-based sectioning fails. The model is trained to recover section structure that headings would normally provide.

Training

Base model: Qwen/Qwen3.5-2B
Method: LoRA fine-tuning (r=32, α=64) via TRL SFTTrainer
Dataset: ~3,455 training / 384 eval examples generated from heading-rich web pages (headings stripped and used as labels)
Epochs: 3 — Train loss: 1.009 — Token accuracy: 80.5%

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, json

model_id = "nahidstaq/distill-structure"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")

SYSTEM = (
    "You are an HTML structure analyzer. Given a compact DOM representation "
    "of a web page (with headings removed), identify the logical sections. "
    "Output a JSON array of sections, each with title, start_text, content_type, and assets fields."
)

def analyze(page_title: str, compact_dom: str) -> list[dict]:
    messages = [
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": f"Page: {page_title}\n\n{compact_dom}"},
    ]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        ids = model.generate(**inputs, max_new_tokens=512, do_sample=False,
                             pad_token_id=tokenizer.eos_token_id)
    raw = tokenizer.decode(ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
    return json.loads(raw)

Output fields

Field	Description
`title`	Short descriptive section title
`start_text`	First ~50 chars of the section's text (for anchoring)
`content_type`	One of: `article`, `list`, `hero`, `navigation`, `footer`, `table`, `faq`, `other`
`assets`	Extracted links, images, or list items relevant to the section

Limitations

Works best on English pages
Table-heavy layouts (e.g. nested <td>) may collapse into fewer sections
content_type classification skews toward other for ambiguous sections

Downloads last month: 7

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for nahidstaq/html-section-retriever

Base model

Qwen/Qwen3.5-2B-Base

Finetuned

Qwen/Qwen3.5-2B

Adapter

(96)

this model