SPADE v2 - Document Field Extraction Model

Extractive QA model for structured data extraction from documents (receipts, invoices, etc.)

Model Details

  • Architecture: ModernBERT-large + span prediction heads
  • Parameters: 395M
  • Max Length: 4096 tokens
  • Task: SQuAD 2.0 style extractive QA

Performance

On validation set (scalar fields):

  • Text Match: 77.86%
  • Exact Match: 77.38%

Usage

from transformers import AutoTokenizer
from model_v2 import SpadeExtractor  # Custom model class

tokenizer = AutoTokenizer.from_pretrained("bluecopa/smalldocs-spade")
model = SpadeExtractor.from_pretrained("bluecopa/smalldocs-spade")

# Format: query + document
query = "total: Total amount due"
document = "Receipt\nItem: Widget $50\nTotal: $50"

inputs = tokenizer(query, document, return_tensors="pt", max_length=4096, truncation=True)
outputs = model(**inputs)

# Decode prediction
# ... see model_v2.py for decode_with_confidence()

Training

Trained on bluecopa/smalldocs-jsonextract-clean dataset.

Limitations

  • Best for scalar fields (merchant name, total, date, etc.)
  • Table/array fields (line items) recommended to use separate table extraction model
  • Some fields like tip, discount, addresses have lower accuracy

License

Apache 2.0

Downloads last month
42
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bluecopa/smalldocs-spade

Finetuned
(228)
this model

Dataset used to train bluecopa/smalldocs-spade