Spaces:
Runtime error
Runtime error
| # LayoutReader | |
| LayoutReader captures the text and layout information for reading order prediction using the seq2seq model. It significantly improves both open-source and commercial OCR engines in ordering text lines in their results in our experiments. | |
| Our paper "[LayoutReader: Pre-training of Text and Layout for Reading Order Detection](https://arxiv.org/pdf/2108.11591.pdf)" has been accepted by EMNLP 2021. | |
| **ReadingBank** is a benchmark dataset for reading order detection built with weak supervision from WORD documents, which contains 500K document images with a wide range of document types as well as the corresponding reading order information. For more details, please refer to [ReadingBank](https://aka.ms/readingbank). | |
| ## Installation | |
| ~~~ | |
| conda create -n LayoutReader python=3.7 | |
| conda activate LayoutReader | |
| conda install pytorch==1.7.1 -c pytorch | |
| pip install nltk | |
| python -c "import nltk; nltk.download('punkt')" | |
| git clone https://github.com/NVIDIA/apex.git && cd apex && python setup.py install --cuda_ext --cpp_ext | |
| pip install transformers==2.10.0 | |
| git clone https://github.com/microsoft/unilm.git | |
| cd unilm/layoutreader | |
| pip install -e . | |
| ~~~ | |
| ## Run | |
| 1. Download the [pre-processed data](https://layoutlm.blob.core.windows.net/readingbank/dataset/ReadingBank.zip?sv=2022-11-02&ss=b&srt=o&sp=r&se=2033-06-08T16:48:15Z&st=2023-06-08T08:48:15Z&spr=https&sig=a9VXrihTzbWyVfaIDlIT1Z0FoR1073VB0RLQUMuudD4%3D | |
| ). For more details of the dataset, please refer to [ReadingBank](https://aka.ms/readingbank). | |
| 2. (Optional) Download our [pre-trained model](https://layoutlm.blob.core.windows.net/readingbank/model/layoutreader-base-readingbank.zip?sv=2022-11-02&ss=b&srt=o&sp=r&se=2033-06-08T16:48:15Z&st=2023-06-08T08:48:15Z&spr=https&sig=a9VXrihTzbWyVfaIDlIT1Z0FoR1073VB0RLQUMuudD4%3D | |
| ) and evaluate it refer to step 4. | |
| 3. Training | |
| ~~~ | |
| export CUDA_VISIBLE_DEVICE=0,1,2,3 | |
| export OMP_NUM_THREADS=4 | |
| export MKL_NUM_THREADS=4 | |
| python -m torch.distributed.launch --nproc_per_node=4 run_seq2seq.py \ | |
| --model_type layoutlm \ | |
| --model_name_or_path layoutlm-base-uncased \ | |
| --train_folder /path/to/ReadingBank/train \ | |
| --output_dir /path/to/output/LayoutReader/layoutlm \ | |
| --do_lower_case \ | |
| --fp16 \ | |
| --fp16_opt_level O2 \ | |
| --max_source_seq_length 513 \ | |
| --max_target_seq_length 511 \ | |
| --per_gpu_train_batch_size 2 \ | |
| --gradient_accumulation_steps 1 \ | |
| --learning_rate 7e-5 \ | |
| --num_warmup_steps 500 \ | |
| --num_training_steps 75000 \ | |
| --cache_dir /path/to/output/LayoutReader/cache \ | |
| --label_smoothing 0.1 \ | |
| --save_steps 5000 \ | |
| --cached_train_features_file /path/to/ReadingBank/features_train.pt | |
| ~~~ | |
| 4. Decoding | |
| ~~~ | |
| export CUDA_VISIBLE_DEVICES=0 | |
| export OMP_NUM_THREADS=4 | |
| export MKL_NUM_THREADS=4 | |
| python decode_seq2seq.py --fp16 \ | |
| --model_type layoutlm \ | |
| --tokenizer_name bert-base-uncased \ | |
| --input_folder /path/to/ReadingBank/test \ | |
| --cached_feature_file /path/to/ReadingBank/features_test.pt \ | |
| --output_file /path/to/output/LayoutReader/layoutlm/output.txt \ | |
| --split test \ | |
| --do_lower_case \ | |
| --model_path /path/to/output/LayoutReader/layoutlm/ckpt-75000 \ | |
| --cache_dir /path/to/output/LayoutReader/cache \ | |
| --max_seq_length 1024 \ | |
| --max_tgt_length 511 \ | |
| --batch_size 32 \ | |
| --beam_size 1 \ | |
| --length_penalty 0 \ | |
| --forbid_duplicate_ngrams \ | |
| --mode s2s \ | |
| --forbid_ignore_word "." | |
| ~~~ | |
| ## Results | |
| Our released [pre-trained model](https://layoutlm.blob.core.windows.net/readingbank/dataset/layoutreader-base-readingbank.zip) achieves 98.2% Average Page-level BLEU score. Detailed results are reported as follow: | |
| * Evaluation results of the LayoutReader on the reading order detection task, where the source-side of training/testing data is in the left-to-right and top-to-bottom order | |
| | Method | Encoder | Avg. Page-level BLEU β | ARD β | | |
| | -------------------------- | ---------------------- | ---------------------- | ----- | | |
| | Heuristic Method | - | 0.6972 | 8.46 | | |
| | LayoutReader (text only) | BERT | 0.8510 | 12.08 | | |
| | LayoutReader (text only) | UniLM | 0.8765 | 10.65 | | |
| | LayoutReader (layout only) | LayoutLM (layout only) | 0.9732 | 2.31 | | |
| | LayoutReader | LayoutLM | 0.9819 | 1.75 | | |
| * Input order study with left-to-right and top-to-bottom inputs in evaluation, where r is the proportion of | |
| shuffled samples in training. | |
| | Method | Avg. Page-level BLEU β | Avg. Page-level BLEU β | Avg. Page-level BLEU β | ARD β | ARD β | ARD β | | |
| |---------------------------------|------------------------|------------------------|------------------------|--------|-------|-------| | |
| | | r=100% | r=50% | r=0% | r=100% | r=50% | r=0% | | |
| | LayoutReader (text only, BERT) | 0.3355 | 0.8397 | 0.8510 | 77.97 | 15.62 | 12.08 | | |
| | LayoutReader (text only, UniLM) | 0.3440 | 0.8588 | 0.8765 | 78.67 | 13.65 | 10.65 | | |
| | LayoutReader (layout only) | 0.9701 | 0.9729 | 0.9732 | 2.85 | 2.61 | 2.31 | | |
| | LayoutReader | 0.9765 | 0.9788 | 0.9819 | 2.50 | 2.24 | 1.75 | | |
| * Input order study with token-shuffled inputs in evaluation, where r is the proportion of shuffled samples in training. | |
| | Method | Avg. Page-level BLEU β | Avg. Page-level BLEU β | Avg. Page-level BLEU β | ARD β | ARD β | ARD β | | |
| |---------------------------------|------------------------|------------------------|------------------------|--------|-------|--------| | |
| | | r=100% | r=50% | r=0% | r=100% | r=50% | r=0% | | |
| | LayoutReader (text only, BERT) | 0.3085 | 0.2730 | 0.1711 | 78.69 | 85.44 | 67.96 | | |
| | LayoutReader (text only, UniLM) | 0.3119 | 0.2855 | 0.1728 | 80.00 | 85.60 | 71.13 | | |
| | LayoutReader (layout only) | 0.9718 | 0.9714 | 0.1331 | 2.72 | 2.82 | 105.40 | | |
| | LayoutReader | 0.9772 | 0.9770 | 0.1783 | 2.48 | 2.46 | 72.94 | | |
| ## Citation | |
| If you find LayoutReader helpful, please cite us: | |
| ``` | |
| @misc{wang2021layoutreader, | |
| title={LayoutReader: Pre-training of Text and Layout for Reading Order Detection}, | |
| author={Zilong Wang and Yiheng Xu and Lei Cui and Jingbo Shang and Furu Wei}, | |
| year={2021}, | |
| eprint={2108.11591}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL} | |
| } | |
| ``` | |
| ## License | |
| This project is licensed under the license found in the LICENSE file in the root directory of this source tree. | |
| Portions of the source code are based on the [transformers](https://github.com/huggingface/transformers) and [s2s-ft](../s2s-ft) projects. | |
| [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct) | |
| ## Contact | |
| For help or issues using LayoutReader, please submit a GitHub issue. | |
| For other communications related to LayoutLM, please contact Lei Cui (`lecu@microsoft.com`), Furu Wei (`fuwei@microsoft.com`). | |