Model Card for Model ID

Text quality prediction (BLEU) for various document parsers given the first page's PyMuPDF-extracted text. This model relates to the second version of AdaParse ("AdaParse v2").

Model Details

Allen AI's Specter fine-tuned for document quality prediction.

Model Description

  • Developed by: Carlo Siebenschuh

Model Sources

Uses

Predict quality of parser output given the extracted text/

Direct Use

Document quality prediction for resource-optimal delegation within AdaParse (version 2 for this particular instance).

Downstream Use [optional]

[More Information Needed]

Out-of-Scope Use

Quality prediction for documents that are (a.) out-of-distribution (e.g., non-scientific) or (b.) for parsers that were not part of the fine-tunign set.

Bias, Risks, and Limitations

Bias: Model was trained on tens of thousands of scientific documents from several journals across eight scientific disciplines (mathematics, engineering, biology, physics, etc.). Naturally, biased towards STEM documents. Limitations: Quality prediction based on a single page's text of one particular extraction tool (PyMuPDF) is challenging.

Recommendations

Fine-tune further on your document corpus.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

~30K documents. Not public.

Training Procedure

Internal software:

run_training.py ... --parser pymupdf --max_page_idx 0 --task reg --alpha 0.5 --multi --batch_size 64 --n_epochs 12 --learn_rate 3e-5

Preprocessing [optional]

None

Training Hyperparameters

  • Training regime: fp32

Speeds, Sizes, Times [optional]

None

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

Summary

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: [More Information Needed]
  • Hours used: [More Information Needed]
  • Cloud Provider: [More Information Needed]
  • Compute Region: [More Information Needed]
  • Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

High-performance compute (Aurora, Polaris, Sophia, Lambda) at Argonne National Laboratory (ANL)/Argonne Leadership Computing Facility (ALCF).

Citation

BibTeX:

@article{siebenschuh2025adaparse,
  title={AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine},
  author={Siebenschuh, Carlo and Hippe, Kyle and Gokdemir, Ozan and Brace, Alexander and Khan, Arham and Hossain, Khalid and Babuji, Yadu and Chia, Nicholas and Vishwanath, Venkatram and Stevens, Rick and others},
  journal={arXiv preprint arXiv:2505.01435},
  year={2025}
}

APA:

Siebenschuh, C., Hippe, K., Gokdemir, O., Brace, A., Khan, A., Hossain, K., ... & Underwood, R. (2025). AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine. arXiv preprint arXiv:2505.01435.

Model Card Authors [optional]

Carlo Siebenschuh

Model Card Contact

7shoe

Downloads last month
22
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support