Pulpie: Pareto-Optimal Models for Cleaning the Web

Community Article
Published July 1, 2026

We're introducing Pulpie, a family of Pareto-optimal models for extracting main content from HTML pages. Pulpie approaches SOTA extraction quality at one twentieth the cost.

Our smallest model, pulpie-orange-small, scores 0.862 ROUGE-5 F1 on WebMainBench. This matches Dripper, the leading extractor, which scores 0.864. Pulpie's performance is despite it being a third the size: 210M parameters versus Dripper's 600M.

The gains come from architecture. Pulpie is an encoder that labels every HTML block as content or boilerplate in a single forward pass. This also makes it fast.

On an NVIDIA L4 GPU, pulpie-orange-small processes 13.7 pages/sec against Dripper's 0.68 pages/sec. At $0.39/hr for an L4 instance, cleaning 1 billion pages costs $7,900 with Pulpie and $159,000 with Dripper.

Pulpie unlocks high quality web extraction at a scale impossible before. We expect this to benefit pre-training and context management.

Our models are open source and available on Hugging Face. See Get started for instructions.

Extraction is the bottleneck

Language models consume the web twice. First in pre-training, where they learn about the world. Then at inference, when they pull in relevant context. Both times the input is mostly noise. During discovery, we found 70% of the blocks on a typical HTML page hold boilerplate like navigation, ads, sidebars, and footers. Main content is only a small fraction of the page.

However, that fraction determines model quality on both ends.

AICC (Ma et al., 2025) measured the effect of cleaner extraction on pre-training. The team built two corpora from the same Common Crawl snapshot. One extracted content with heuristics. The other extracted it with a model-based parser. Everything else in the data pipeline remained equal. They then trained an identical model on each corpus.

The model trained on the model-extracted corpus scored 1.08 percentage points higher in average accuracy across 13 benchmarks. Since only extraction logic changed, we can attribute the gain entirely to having cleaner data.

Impressively, the same model also beat models trained on FineWeb and RefinedWeb, two of the most heavily filtered pre-training corpora. These datasets have earned their reputations through elaborate filtering and deduplication. Beating them by improving the extractor illustrates the high value of clean data.

Beyond setting a low baseline, poor extraction materially harms models. Heuristics break structured content. The table below shows how Trafilatura and model-based extractors compare on preserving code blocks and formulas. Low similarity scores indicate corruption. If used in training, resulting models will inherit this damage.

Content Trafilatura (heuristic) Model-based
Code blocks 0.13 0.91
Formulas 0.61 0.94

Data quality matters at inference too. Shi et al. (ICML 2023) showed that a single irrelevant passage is enough to derail a model's answer. A model is more accurate and more efficient when its context is free of noise.

Cleaning on a budget

Cleaning the web pays off in both training and inference. The open question is how do we clean well at scale?

First, to understand the landscape, we can divide current extractors into two families based on the question: Does the method read the page, or inspect its structure?

Structure-based extractors judge an HTML block by surface signals. They apply rules over tags, DOM, and text density to separate content from boilerplate. Trafilatura, Readability, and magic-html work this way. Boilerpipe goes one step further and trains a classifier on those same signals. These extractors are easy to run but they confuse similarly built elements. A navigation table and a data table look identical to an algorithm counting cells.

Reading extractors feed the page to a transformer and label each block based on its content. Dripper is a decoder built on this idea. The decoder emits labels one token at a time. Each label forces the full model to be read from memory for a single step of work. This ties speed to memory bandwidth and makes runs expensive.

Pulpie keeps the reading approach but moves the bottleneck to compute. We do this by using an encoder architecture that labels every block in a single forward pass. This enables Pulpie to match Dripper's quality while being smaller, faster, and cheaper.

Depulping raw HTML

The full pipeline runs in four stages:

  1. Simplify the HTML. Remove scripts, styles, and other formatting noise. Tag each block with a unique ID.
  2. Chunk the blocks. Split the blocks, tokenize them, and pack them into chunks of at most 8,192 tokens, so each chunk fits the model in one pass. About 80% of pages fit in a single chunk.
  3. Classify. Run a forward pass. Pulpie labels each block as content or boilerplate.
  4. Return. Return the kept blocks as HTML, or convert them to Markdown.

Training

Training Pulpie needed a large set of HTML pages with block-level labels. No such public set existed, so we built one.

We sampled 16,670 English pages from Common Crawl, limiting to one per domain. We then used MinerU-HTML to split each page into blocks, and labeled each block as content or boilerplate with DeepSeek V3.2. Further filtering removed empty, corrupted, and otherwise unfit pages, leaving 15,880.

We then ran Dripper 0.6B as a second labeler across all 15,880 pages to flag inconsistent labels. Block-level agreement with DeepSeek was 93.3%. We kept the 14,959 pages where the two labelers agreed on at least 70% of blocks, trading some data for a cleaner training set.

Teaching a teacher

To create our teacher model, we fine-tuned EuroBERT-2.1B on the aforementioned 14,959 pages.

Setting Value
Learning rate 2e-5
Effective batch size 8
Loss Class-weighted cross-entropy
Hardware 4x A100

Class weights are set inversely to the 28.6% content rate to counter the imbalance.

The teacher scored 0.873 ROUGE-5 F1 on the WebMainBench English set. At 2.1B parameters it is accurate but expensive to run, so we distilled it into smaller models.

Imparting knowledge

For a better production fit, we distilled the 2.1B teacher into two smaller models:

  • Pulpie Orange Base, a 610M parameter encoder.
  • Pulpie Orange Small, a 210M parameter encoder.

Both students learn from the teacher following Hinton et al. (2015). The teacher's softened output distribution supplies most of the signal through a KL-divergence loss weighted 0.7, with hard-label cross-entropy making up the remaining 0.3, at temperature 2.0. Both train on the same data as the teacher.

The distilled models keep almost all of the teacher's quality.

Model Parameters ROUGE-5 F1 vs. Teacher
Pulpie Orange Small 210M 0.862 -1.1 F1 points
Dripper 0.6B 0.864 -0.9 F1 points
Pulpie Orange Base 610M 0.863 -1.0 F1 points
Pulpie Orange Large (teacher) 2.1B 0.873 -

Despite a tenfold cut in size, the 210M model is within one F1 point. Combined with its speed and cost benefits, pulpie-orange-small features the best size-to-quality ratio in the entire family. It is the model we recommend for production use.

Results

Quality

We measure ROUGE-5 F1 on the English subset of WebMainBench (6,647 pages across all difficulty levels). Empty extractions count as zero.

Method ROUGE-5 F1 Empty pages
magic-html 0.700 384
Trafilatura 0.619 16
Pulpie Orange Small 0.862 45
Dripper 0.864 135
Pulpie Orange Base 0.863 36
Pulpie Orange Large 0.873 21

Pulpie Orange Large is the strongest single model at 0.873, ahead of Dripper by 0.9 F1 points. The 210M model ties Dripper at a third the size. Frontier LLMs score higher on this benchmark, near 0.90, which is the quality Pulpie approaches.

Dripper returns nothing on 135 pages. 130 are due to the page overflowing its 32k-token context window. Pulpie packs blocks into 8,192-token chunks, so page length never forces a failure.

Breaking results down by difficulty:

Method All Simple Mid Hard
magic-html 0.700 0.773 0.697 0.637
Trafilatura 0.619 0.721 0.619 0.526
Pulpie Orange Small 0.862 0.906 0.868 0.813
Dripper 0.864 0.913 0.865 0.817
Pulpie Orange Base 0.863 0.906 0.868 0.818
Pulpie Orange Large 0.873 0.914 0.879 0.827

Every method loses ground as pages get harder. The heuristics fall fastest, dropping 14 to 20 F1 points from simple to hard, while the encoders give up about 9 F1 points. Dripper's performance range matches the encoders, with a gap of 10 F1 points between simple and hard pages.

Speed

L4 throughput, on 500 real Common Crawl pages:

Method Throughput (pages/sec) Hardware
Pulpie Orange Small 13.7 L4
Dripper 0.68 L4
Pulpie Orange Base 3.9 L4
Pulpie Orange Large 1.3 L4

Pulpie Orange Small runs 20x faster than Dripper on the same L4.

A100 throughput, same pages, GPU inference only, batched for every model:

Method Throughput (pages/sec) Hardware
Pulpie Orange Small 25.7 A100
Dripper 3.6 A100
Pulpie Orange Base 7.7 A100
Pulpie Orange Large 3.5 A100

On the A100, Pulpie Orange Small runs 7.1x faster than Dripper. The 2.1B teacher matches Dripper on speed while beating it on quality.

Cost

L4 cost for 1 billion pages at $0.39/hr. Calculated using the throughputs measured above:

Setup Pages/sec GPU-hours / 1B Cost / 1B pages
Pulpie Small on L4 13.7 20,300 ~$7,900
Dripper on L4 0.68 408,000 ~$159,000
Pulpie Base on L4 3.9 71,200 ~$28,000
Pulpie Large on L4 1.3 214,000 ~$83,000

A100 cost for 1 billion pages at $2.72/hr. Calculated using the throughputs measured above:

Setup Pages/sec GPU-hours / 1B Cost / 1B pages
Pulpie Small on A100 25.7 10,800 ~$29,000
Dripper on A100 3.6 77,200 ~$210,000
Pulpie Base on A100 7.7 36,100 ~$98,000
Pulpie Large on A100 3.5 79,400 ~$216,000

Cheap GPUs like Encoders

The throughput gap between Pulpie and Dripper is much larger than a 3x difference in size would imply. On the A100, we measure this gap as 7.1x, and on the L4 it widens to 20x. The reason for this is architectural.

A decoder generates labels one token at a time. Each step reads the full model from GPU memory to produce a single token. Consequently, a decoder's speed is bound by memory bandwidth. Conversely, an encoder runs one forward pass over the whole input. This dense matrix multiply is limited only by compute.

Add to the above that A100 and L4 differ more in bandwidth than in compute:

Dimension NVIDIA A100 NVIDIA L4 Ratio (A100/L4)
Memory Bandwidth 2,039 GB/s 300 GB/s ~6.8x
Tensor Core TFLOPS 312 120 ~2.6x

Dropping from A100 to L4 starves the bandwidth-bound decoder far more than the compute-bound encoder. This widens the throughput gap and lets Pulpie Orange Large pull ahead on L4 despite matching Dripper on A100.

Get started

The Pulpie models are on Hugging Face. Install the package:

pip install pulpie

Extract clean content from raw HTML:

from pulpie import Extractor

extractor = Extractor()               # defaults to Pulpie Orange Small
result = extractor.extract(html)

print(result.markdown)                # clean markdown
print(result.n_main, result.n_other)  # blocks kept vs dropped

For maximum quality over speed, pick a larger model:

extractor = Extractor(model="large")  # "small" (default), "base", or "large"

For bulk processing, the pipeline overlaps CPU preprocessing with GPU inference across one or more GPUs:

from pulpie import Pipeline, PageInput

pipeline = Pipeline(model="small")
results = pipeline.extract_batch(
    [PageInput(html=h, page_id=i) for i, h in enumerate(pages)]
)

All three models are built on EuroBERT (Boizard et al., 2025), use the same <|sep|> block-marker architecture, and share a tokenizer:

Name Hugging Face Parameters ROUGE-5 F1 Notes
Orange Small feyninc/pulpie-orange-small-v1 210M 0.862 Recommended
Orange Base feyninc/pulpie-orange-base-v1 610M 0.863 Distilled from Large
Orange Large feyninc/pulpie-orange-large-v1 2.1B 0.873 Teacher

Pulpie Orange Small is the recommended and default model. It approaches SOTA extraction quality at one twentieth the cost and runs the fastest.

Pulpie is built by Feyn. Find us on GitHub, Hugging Face, or X.

Acknowledgements

Pulpie builds directly on the work of the MinerU-HTML and Dripper team (Ma et al., 2025). Their simplify_html preprocessing, block-level annotation scheme, and the WebMainBench benchmark are foundational to this work. We also use their Dripper 0.6B model to cross-validate our training labels. We're grateful they released their tools and data.

References

[1] Ma et al. "AICC: Parse HTML Finer, Make Models Better - A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser." arXiv:2511.16397 (2025).

[2] Boizard et al. "EuroBERT: Scaling Multilingual Encoders for European Languages." arXiv:2503.05500 (2025).

[3] Hinton et al. "Distilling the Knowledge in a Neural Network." arXiv:1503.02531 (2015).

[4] Raffel et al. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." JMLR 2020.

[5] Penedo et al. "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only." NeurIPS 2023.

[6] Penedo et al. "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." arXiv:2406.17557 (2024).

[7] Li et al. "DataComp-LM: In Search of the Next Generation of Training Sets for Language Models." NeurIPS 2024.

[8] Soldaini et al. "Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research." ACL 2024.

[9] Barbaresi. "Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction." ACL/IJCNLP 2021.

[10] Kohlschutter et al. "Boilerplate Detection using Shallow Text Features." WSDM 2010.

[11] Pomikalek. "Removing Boilerplate and Duplicate Content from Web Corpora." PhD thesis, Masaryk University, 2011.

[12] Bevendorff et al. "An Empirical Comparison of Web Content Extraction Algorithms." SIGIR 2023.

[13] Shi et al. "Large Language Models Can Be Easily Distracted by Irrelevant Context." ICML 2023.

Community

Sign up or log in to comment