Initial release.

37ce69f almost 2 years ago

4.6 kB

	---
	license: cc-by-sa-3.0
	language: ja
	inference: false
	---

	# LayoutLM-wikipedia-ja Model

	This is a [LayoutLM](https://doi.org/10.1145/3394486.3403172) model pretrained on texts in the Japanese language.

	## Model Details

	### Model Description

	- Developed by: Advanced Technology Laboratory, The Japan Research Institute, Limited.
	- Model type: LayoutLM
	- Language: Japanese
	- License: [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/)
	- Finetuned from model: [cl-tohoku/bert-base-japanese-v2](https://huggingface.co/cl-tohoku/bert-base-japanese-v2)

	## Uses

	The model is primarily aimed at being fine-tuned on a token classification task. You can use the raw model for masked language modeling, although it is not the primary use case. Refer to <https://github.com/nishiwakikazutaka/shinra2022-task2_jrird> for instructions on how to fine-tune the model. Note that the linked repository is written in Japanese.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	>>> from transformers import AutoTokenizer, AutoModel
	>>> import torch

	>>> tokenizer = AutoTokenizer.from_pretrained("jri-advtechlab/layoutlm-wikipedia-ja")
	>>> model = AutoModel.from_pretrained("jri-advtechlab/layoutlm-wikipedia-ja")

	>>> tokens = tokenizer.tokenize("こんにちは") # ['こん', '##にち', '##は']
	>>> normalized_token_boxes = [[637, 773, 693, 782], [693, 773, 749, 782], [749, 773, 775, 782]]
	>>> # add bounding boxes of cls + sep tokens
	>>> bbox = [[0, 0, 0, 0]] + normalized_token_boxes + [[1000, 1000, 1000, 1000]]

	>>> input_ids = [tokenizer.cls_token_id] \
	+ tokenizer.convert_tokens_to_ids(tokens) \
	+ [tokenizer.sep_token_id]
	>>> attention_mask = [1] * len(input_ids)
	>>> token_type_ids = [0] * len(input_ids)
	>>> encoding = {
	"input_ids": torch.tensor([input_ids]),
	"attention_mask": torch.tensor([attention_mask]),
	"token_type_ids": torch.tensor([token_type_ids]),
	"bbox": torch.tensor([bbox]),
	}

	>>> outputs = model(**encoding)
	```

	## Training Details

	### Training Data

	The model is trained on the Japanese version of Wikipedia. The training corpus is distributed as [training data of the SHINRA 2022 shared task](https://2022.shinra-project.info/data-download#subtask-common).

	### Tokenization and Localization

	We used the tokenizer of [cl-tohoku/bert-base-japanese-v2](https://huggingface.co/cl-tohoku/bert-base-japanese-v2) to split texts into tokens (subwords). Each token is wrapped in a `<span>` tag with the no-wrap value set for the white-space property and localized by obtaining `BoundingClientRect`. The localization process was conducted with Google Chrome (106.0.5249.119) headless mode on Ubuntu 20.04.5 LTS with a 1,280*854 window size.

	The vocabulary is the same as [cl-tohoku/bert-base-japanese-v2](https://huggingface.co/cl-tohoku/bert-base-japanese-v2).

	### Training Procedure

	The model was trained using Masked Visual-Language Model (MVLM), but it was not trained using Multi-label Document Classification (MDC). We made this decision because we did not identify significant visual differences, such as those between a contract and an invoice, between the different Wikipedia articles.

	#### Preprocessing

	All parameters except the 2-D Position Embedding were initialized with weights from [cl-tohoku/bert-base-japanese-v2](https://huggingface.co/cl-tohoku/bert-base-japanese-v2). We initialized the 2-D Position Embedding with random values.

	#### Training Hyperparameters

	The model was trained on 8 NVIDIA A100 SXM4 GPUs for 100,000 steps, with a batch size of 256 with a maximum sequence length of 512. The optimizer used is Adam with a learning rate of 5e-5, β<sub>1</sub>=0.9, β<sub>2</sub>=0.999, learning rate warmup for 1,000 steps, and linear decay of the learning rate after. Additionally, we utilized fp16 mixed precision during training. The training took about 5.3 hours to finish.

	## Evaluation

	Our fine-tuned model achieved a macro-f1 score of 55.1451 on the leaderboard for the SHINRA 2022 shared task. You can check the leaderboard at [https://2022.shinra-project.info/#leaderboard](https://2022.shinra-project.info/#leaderboard) for detailed information.

	## Citation

	BibTeX:

	```tex
	@inproceedings{nishiwaki2023layoutlm-wiki-ja,
	title = {日本語情報抽出タスクのための{L}ayout{LM}モデルの評価},
	author = {西脇一尊 and 大沼俊輔 and 門脇一真},
	booktitle = {言語処理学会第29回年次大会(NLP2023)予稿集},
	year = {2023},
	pages = {522--527}
	}
	```