File size: 5,240 Bytes
3e4fd32
397a91c
 
3e4fd32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b92ba3f
3e4fd32
 
b92ba3f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3e4fd32
 
 
 
 
 
 
2bfe5f8
 
 
 
 
 
 
 
 
 
 
 
 
3e4fd32
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
library_name: transformers
pipeline_tag: table-question-answering
license: mit
datasets:
- ethanbradley/synfintabs
language:
- en
base_model:
- microsoft/layoutlm-base-uncased
---

# FinTabQA: Financial Table Question-Answering

A model for financial table question-answering using the [LayoutLM](https://huggingface.co/microsoft/layoutlm-base-uncased) architecture.

## Quick start

To get started with FinTabQA, load it, and a fast tokenizer, like you would any other Hugging Face Transformer model and tokenizer. Below is a minimum working example using the [SynFinTabs](https://huggingface.co/datasets/ethanbradley/synfintabs) dataset.

```python3
>>> from typing import List, Tuple
>>> from datasets import load_dataset
>>> from transformers import LayoutLMForQuestionAnswering, LayoutLMTokenizerFast
>>> import torch
>>> 
>>> synfintabs_dataset = load_dataset("ethanbradley/synfintabs")
>>> model = LayoutLMForQuestionAnswering.from_pretrained("ethanbradley/fintabqa")
>>> tokenizer = LayoutLMTokenizerFast.from_pretrained(
...     "microsoft/layoutlm-base-uncased")
>>> 
>>> def normalise_boxes(
...         boxes: List[List[int]],
...         old_image_size: Tuple[int, int],
...         new_image_size: Tuple[int, int]) -> List[List[int]]:
...     old_im_w, old_im_h = old_image_size
...     new_im_w, new_im_h = new_image_size
... 
...     return [[
...         max(min(int(x1 / old_im_w * new_im_w), new_im_w), 0),
...         max(min(int(y1 / old_im_h * new_im_h), new_im_h), 0),
...         max(min(int(x2 / old_im_w * new_im_w), new_im_w), 0),
...         max(min(int(y2 / old_im_h * new_im_h), new_im_h), 0)
...     ] for (x1, y1, x2, y2) in boxes]
>>> 
>>> item = synfintabs_dataset['test'][0]
>>> question_dict = next(question for question in item['questions']
...     if question['id'] == item['question_id'])
>>> encoding = tokenizer(
...     question_dict['question'].split(),
...     item['ocr_results']['words'],
...     max_length=512,
...     padding="max_length",
...     truncation="only_second",
...     is_split_into_words=True,
...     return_token_type_ids=True,
...     return_tensors="pt")
>>> 
>>> word_boxes = normalise_boxes(
...     item['ocr_results']['bboxes'],
...     item['image'].crop(item['bbox']).size,
...     (1000, 1000))
>>> token_boxes = []
>>> 
>>> for i, s, w in zip(
...         encoding['input_ids'][0],
...         encoding.sequence_ids(0),
...         encoding.word_ids(0)):
...     if s == 1:
...         token_boxes.append(word_boxes[w])
...     elif i == tokenizer.sep_token_id:
...         token_boxes.append([1000] * 4)
...     else:
...         token_boxes.append([0] * 4)
>>> 
>>> encoding['bbox'] = torch.tensor([token_boxes])
>>> outputs = model(**encoding)
>>> start = encoding.word_ids(0)[outputs['start_logits'].argmax(-1)]
>>> end = encoding.word_ids(0)[outputs['end_logits'].argmax(-1)]
>>> 
>>> print(f"Target: {question_dict['answer']}")
Target: 6,980
>>> 
>>> print(f"Prediction: {' '.join(item['ocr_results']['words'][start : end])}")
Prediction: 6,980
```

## Citation

If you use this model, please cite both the article using the citation below and the model itself.

```bib
@inproceedings{bradley2026synfintabs,
    title        = {Syn{F}in{T}abs: A Dataset of Synthetic Financial Tables for Information and Table Extraction},
    author       = {Bradley, Ethan and Roman, Muhammad and Rafferty, Karen and Devereux, Barry},
    year         = 2026,
    month        = jan,
    booktitle    = {Document Analysis and Recognition -- ICDAR 2025 Workshops},
    publisher    = {Springer Nature Switzerland},
    address      = {Cham},
    pages        = {85--100},
    doi          = {10.1007/978-3-032-09371-4_6},
    isbn         = {978-3-032-09371-4},
    editor       = {Jin, Lianwen and Zanibbi, Richard and Eglin, Veronique},
    abstract     = {Table extraction from document images is a challenging AI problem, and labelled data for many content domains is difficult to come by. Existing table extraction datasets often focus on scientific tables due to the vast amount of academic articles that are readily available, along with their source code. However, there are significant layout and typographical differences between tables found across scientific, financial, and other domains. Current datasets often lack the words, and their positions, contained within the tables, instead relying on unreliable OCR to extract these features for training modern machine learning models on natural language processing tasks. Therefore, there is a need for a more general method of obtaining labelled data. We present SynFinTabs, a large-scale, labelled dataset of synthetic financial tables. Our hope is that our method of generating these synthetic tables is transferable to other domains. To demonstrate the effectiveness of our dataset in training models to extract information from table images, we create FinTabQA, a layout large language model trained on an extractive question-answering task. We test our model using real-world financial tables and compare it to a state-of-the-art generative model and discuss the results. We make the dataset, model, and dataset generation code publicly available (https://ethanbradley.co.uk/research/synfintabs).}
}
```