|
|
--- |
|
|
license: cc-by-sa-3.0 |
|
|
language: ja |
|
|
inference: false |
|
|
--- |
|
|
|
|
|
# LayoutLM-wikipedia-ja Model |
|
|
|
|
|
This is a [LayoutLM](https://doi.org/10.1145/3394486.3403172) model pretrained on texts in the Japanese language. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
- **Developed by:** Advanced Technology Laboratory, The Japan Research Institute, Limited. |
|
|
- **Model type:** LayoutLM |
|
|
- **Language:** Japanese |
|
|
- **License:** [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) |
|
|
- **Finetuned from model:** [cl-tohoku/bert-base-japanese-v2](https://huggingface.co/cl-tohoku/bert-base-japanese-v2) |
|
|
|
|
|
## Uses |
|
|
|
|
|
The model is primarily aimed at being fine-tuned on a token classification task. You can use the raw model for masked language modeling, although it is not the primary use case. Refer to <https://github.com/nishiwakikazutaka/shinra2022-task2_jrird> for instructions on how to fine-tune the model. Note that the linked repository is written in Japanese. |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Use the code below to get started with the model. |
|
|
|
|
|
```python |
|
|
>>> from transformers import AutoTokenizer, AutoModel |
|
|
>>> import torch |
|
|
|
|
|
>>> tokenizer = AutoTokenizer.from_pretrained("jri-advtechlab/layoutlm-wikipedia-ja") |
|
|
>>> model = AutoModel.from_pretrained("jri-advtechlab/layoutlm-wikipedia-ja") |
|
|
|
|
|
>>> tokens = tokenizer.tokenize("こんにちは") # ['こん', '##にち', '##は'] |
|
|
>>> normalized_token_boxes = [[637, 773, 693, 782], [693, 773, 749, 782], [749, 773, 775, 782]] |
|
|
>>> # add bounding boxes of cls + sep tokens |
|
|
>>> bbox = [[0, 0, 0, 0]] + normalized_token_boxes + [[1000, 1000, 1000, 1000]] |
|
|
|
|
|
>>> input_ids = [tokenizer.cls_token_id] \ |
|
|
+ tokenizer.convert_tokens_to_ids(tokens) \ |
|
|
+ [tokenizer.sep_token_id] |
|
|
>>> attention_mask = [1] * len(input_ids) |
|
|
>>> token_type_ids = [0] * len(input_ids) |
|
|
>>> encoding = { |
|
|
"input_ids": torch.tensor([input_ids]), |
|
|
"attention_mask": torch.tensor([attention_mask]), |
|
|
"token_type_ids": torch.tensor([token_type_ids]), |
|
|
"bbox": torch.tensor([bbox]), |
|
|
} |
|
|
|
|
|
>>> outputs = model(**encoding) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model is trained on the Japanese version of Wikipedia. The training corpus is distributed as [training data of the SHINRA 2022 shared task](https://2022.shinra-project.info/data-download#subtask-common). |
|
|
|
|
|
### Tokenization and Localization |
|
|
|
|
|
We used the tokenizer of [cl-tohoku/bert-base-japanese-v2](https://huggingface.co/cl-tohoku/bert-base-japanese-v2) to split texts into tokens (subwords). Each token is wrapped in a `<span>` tag with the no-wrap value set for the white-space property and localized by obtaining `BoundingClientRect`. The localization process was conducted with Google Chrome (106.0.5249.119) headless mode on Ubuntu 20.04.5 LTS with a 1,280*854 window size. |
|
|
|
|
|
The vocabulary is the same as [cl-tohoku/bert-base-japanese-v2](https://huggingface.co/cl-tohoku/bert-base-japanese-v2). |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
The model was trained using Masked Visual-Language Model (MVLM), but it was not trained using Multi-label Document Classification (MDC). We made this decision because we did not identify significant visual differences, such as those between a contract and an invoice, between the different Wikipedia articles. |
|
|
|
|
|
#### Preprocessing |
|
|
|
|
|
All parameters except the 2-D Position Embedding were initialized with weights from [cl-tohoku/bert-base-japanese-v2](https://huggingface.co/cl-tohoku/bert-base-japanese-v2). We initialized the 2-D Position Embedding with random values. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
The model was trained on 8 NVIDIA A100 SXM4 GPUs for 100,000 steps, with a batch size of 256 with a maximum sequence length of 512. The optimizer used is Adam with a learning rate of 5e-5, β<sub>1</sub>=0.9, β<sub>2</sub>=0.999, learning rate warmup for 1,000 steps, and linear decay of the learning rate after. Additionally, we utilized fp16 mixed precision during training. The training took about 5.3 hours to finish. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
Our fine-tuned model achieved a macro-f1 score of 55.1451 on the leaderboard for the SHINRA 2022 shared task. You can check the leaderboard at [https://2022.shinra-project.info/#leaderboard](https://2022.shinra-project.info/#leaderboard) for detailed information. |
|
|
|
|
|
## Citation |
|
|
|
|
|
**BibTeX:** |
|
|
|
|
|
```tex |
|
|
@inproceedings{nishiwaki2023layoutlm-wiki-ja, |
|
|
title = {日本語情報抽出タスクのための{L}ayout{LM}モデルの評価}, |
|
|
author = {西脇一尊 and 大沼俊輔 and 門脇一真}, |
|
|
booktitle = {言語処理学会第29回年次大会(NLP2023)予稿集}, |
|
|
year = {2023}, |
|
|
pages = {522--527} |
|
|
} |
|
|
``` |
|
|
|