distill-structure

A fine-tuned Qwen3.5-2B model for HTML structure analysis โ€” given a compact DOM representation of a web page, it identifies the logical sections and outputs structured JSON.

What it does

Takes a cleaned, heading-stripped HTML page and returns a JSON array describing its sections:

[
  {
    "title": "Main News Feed Content",
    "start_text": "1. Canada's bill C-22 mandates...",
    "content_type": "article",
    "assets": [{"type": "link", "value": "Canada's bill C-22..."}]
  },
  {
    "title": "Site Footer Navigation",
    "start_text": "Guidelines | FAQ | Lists",
    "content_type": "footer",
    "assets": []
  }
]

Use case

This model powers the StructureAgent inside the distill pipeline โ€” it handles pages with no heading tags where rule-based sectioning fails. The model is trained to recover section structure that headings would normally provide.

Training

  • Base model: Qwen/Qwen3.5-2B
  • Method: LoRA fine-tuning (r=32, ฮฑ=64) via TRL SFTTrainer
  • Dataset: ~3,455 training / 384 eval examples generated from heading-rich web pages (headings stripped and used as labels)
  • Epochs: 3 โ€” Train loss: 1.009 โ€” Token accuracy: 80.5%

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, json

model_id = "nahidstaq/distill-structure"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")

SYSTEM = (
    "You are an HTML structure analyzer. Given a compact DOM representation "
    "of a web page (with headings removed), identify the logical sections. "
    "Output a JSON array of sections, each with title, start_text, content_type, and assets fields."
)

def analyze(page_title: str, compact_dom: str) -> list[dict]:
    messages = [
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": f"Page: {page_title}\n\n{compact_dom}"},
    ]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        ids = model.generate(**inputs, max_new_tokens=512, do_sample=False,
                             pad_token_id=tokenizer.eos_token_id)
    raw = tokenizer.decode(ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
    return json.loads(raw)

Output fields

Field Description
title Short descriptive section title
start_text First ~50 chars of the section's text (for anchoring)
content_type One of: article, list, hero, navigation, footer, table, faq, other
assets Extracted links, images, or list items relevant to the section

Limitations

  • Works best on English pages
  • Table-heavy layouts (e.g. nested <td>) may collapse into fewer sections
  • content_type classification skews toward other for ambiguous sections
Downloads last month
523
Safetensors
Model size
2B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for nahidstaq/html-section-retriever

Finetuned
Qwen/Qwen3.5-2B
Adapter
(27)
this model