SPADE v2 - Document Field Extraction Model
Extractive QA model for structured data extraction from documents (receipts, invoices, etc.)
Model Details
- Architecture: ModernBERT-large + span prediction heads
- Parameters: 395M
- Max Length: 4096 tokens
- Task: SQuAD 2.0 style extractive QA
Performance
On validation set (scalar fields):
- Text Match: 77.86%
- Exact Match: 77.38%
Usage
from transformers import AutoTokenizer
from model_v2 import SpadeExtractor # Custom model class
tokenizer = AutoTokenizer.from_pretrained("bluecopa/smalldocs-spade")
model = SpadeExtractor.from_pretrained("bluecopa/smalldocs-spade")
# Format: query + document
query = "total: Total amount due"
document = "Receipt\nItem: Widget $50\nTotal: $50"
inputs = tokenizer(query, document, return_tensors="pt", max_length=4096, truncation=True)
outputs = model(**inputs)
# Decode prediction
# ... see model_v2.py for decode_with_confidence()
Training
Trained on bluecopa/smalldocs-jsonextract-clean dataset.
Limitations
- Best for scalar fields (merchant name, total, date, etc.)
- Table/array fields (line items) recommended to use separate table extraction model
- Some fields like tip, discount, addresses have lower accuracy
License
Apache 2.0
- Downloads last month
- 42
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for bluecopa/smalldocs-spade
Base model
answerdotai/ModernBERT-large