md-reheader

Restore heading hierarchy in markdown documents using a fine-tuned Qwen3-0.6B model (0.6B parameters).

PDF-to-markdown tools like MinerU, Docling, and Marker often flatten heading structure — everything becomes # or ##. This model predicts the correct heading levels (H1–H6) from document context and semantics.

Before (flat output from a PDF parser)

# API Reference

# Authentication

All API requests require a Bearer token in the Authorization header.

# Endpoints

# Users

# List Users

Returns a paginated list of all users in the organization.

# Create User

Creates a new user account. Requires admin privileges.

# Projects

# List Projects

Returns all projects the authenticated user has access to.

# Error Handling

All errors follow the standard format with a status code and message field.

After (restored by md-reheader)

# API Reference

## Authentication

All API requests require a Bearer token in the Authorization header.

## Endpoints

### Users

#### List Users

Returns a paginated list of all users in the organization.

#### Create User

Creates a new user account. Requires admin privileges.

### Projects

#### List Projects

Returns all projects the authenticated user has access to.

## Error Handling

All errors follow the standard format with a status code and message field.

Usage

With the md-reheader package (recommended)

pip install md-reheader
from md_reheader.inference.predict import load_model, reheader_document

model, tokenizer = load_model("joelbarmettler/md-reheader")

flat_markdown = open("document.md").read()
fixed = reheader_document(flat_markdown, model, tokenizer)

The package handles preprocessing (heading flattening, body stripping) and postprocessing (applying predicted levels back to the original document) automatically.

Direct usage with transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("joelbarmettler/md-reheader")
model = AutoModelForCausalLM.from_pretrained(
    "joelbarmettler/md-reheader",
    dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a markdown document structure expert. Given a markdown document with incorrect or flattened heading levels, output each heading with its correct markdown prefix (# for level 1, ## for level 2, etc.), one per line."},
    {"role": "user", "content": "# Introduction\n\nSome text...\n\n# Background\n\nMore text...\n\n# Methods"},
]

input_text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False,
)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=False)

generated = outputs[0][inputs["input_ids"].shape[1]:]
print(tokenizer.decode(generated, skip_special_tokens=True))
# # Introduction
# ## Background
# ## Methods

Important: You must pass enable_thinking=False to apply_chat_template to match the training format. Without this, the model will not generate correctly.

Evaluation Results

Evaluated on 7,321 test documents from GitHub markdown files and Wikipedia articles:

Metric All-H1 baseline Heuristic md-reheader
Exact match 0.0% 14.5% 56.1%
Per-heading accuracy 13.1% 49.1% 80.6%
Hierarchy preservation 61.3% 68.6% 91.0%
Mean absolute error 1.38 0.62 0.22

Per-level accuracy

H1 H2 H3 H4 H5 H6
Accuracy 77% 85% 78% 68% 45% 50%

By document depth

Max depth Exact match Accuracy Hierarchy
Depth 2 83% 91% 95%
Depth 3 54% 82% 90%
Depth 4 32% 70% 88%
Depth 5-6 33% 65% 89%

Training

  • Base model: Qwen/Qwen3-0.6B (text-only, 0.6B parameters)
  • Framework: Axolotl
  • Data: ~197k markdown documents from GitHub and Wikipedia (with deep-doc oversampling)
  • Hardware: 2x NVIDIA RTX 4090, DDP, BF16
  • Sequence length: 8,192 tokens with sample packing
  • Learning rate: 5e-5 with cosine schedule
  • Epochs: 2 (epoch 1 checkpoint used — epoch 2 overfits)
  • Effective batch size: 24

Input format

During training, documents are preprocessed:

  1. All headings flattened to # (level 1)
  2. Body text between headings truncated to first 128 + last 128 tokens
  3. Total document truncated to fit within 8k tokens

The model then outputs each heading with the correct # prefix and heading text.

Speed

Document size GPU (RTX 4090) CPU
< 1k tokens 0.4s 5s
1k-2k tokens 0.8s 10s
2k-4k tokens 1.4s ~20s
4k-8k tokens 3.4s ~60s

Limitations

  • Deep nesting (H5/H6): Accuracy drops to 45-50%. The model preserves relative structure but compresses deep hierarchies.
  • Ambiguous structure: Heading levels are inherently subjective. The model learns conventions but cannot resolve genuine ambiguity.
  • Long documents: Documents exceeding ~8k tokens after preprocessing are truncated.
  • English-centric: Trained primarily on English documents.

Author

Built by Joel Barmettler.

Citation

@software{barmettler2026mdreheader,
  author = {Barmettler, Joel},
  title = {md-reheader: Restoring Heading Hierarchy in Markdown Documents},
  year = {2026},
  url = {https://github.com/joelbarmettlerUZH/md-reheader}
}
Downloads last month
323
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for joelbarmettler/md-reheader

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(763)
this model

Dataset used to train joelbarmettler/md-reheader