| # Multi-doc News Headline Generation Model: NHNet | |
| This repository contains TensorFlow 2.x implementation for NHNet [[1]](#1) as | |
| well as instructions for producing the data we described in the paper. | |
| ## Introduction | |
| NHNet is a multi-doc news headline generation model. It extends a standard | |
| Transformer-based encoder-decoder model to multi-doc setting and relies on an | |
| article-level attention layer to capture information common to most (if not all) | |
| input news articles in a news cluster or story, and provide robustness against | |
| potential outliers in the input due to clustering quality. | |
| Our academic paper [[1]](#1) which describes NHNet in detail can be found here: | |
| https://arxiv.org/abs/2001.09386. | |
| ## Dataset | |
| **Raw Data:** One can [download](https://github.com/google-research-datasets/NewSHead) | |
| our multi-doc headline dataset which | |
| contains 369,940 news stories and 932,571 unique URLs. We split these stories | |
| into train (359,940 stories), validation (5,000 stories) and test set (5,000 | |
| stories) by timestamp. | |
| More information, please checkout: | |
| https://github.com/google-research-datasets/NewSHead | |
| ### Crawling | |
| Unfortunately, we will not be able to release the pre-processed dataset that is | |
| exactly used in the paper. Users need to crawl the URLs and the recommended | |
| pre-processing is using an open-sourced library to download and parse the news | |
| content including title and leading paragraphs. For ease of this process, we | |
| provide a config of [news-please](https://github.com/fhamborg/news-please) that | |
| will crawl and extract news articles on a local machine. | |
| First, install the `news-please` CLI (requires python 3.x) | |
| ```shell | |
| $ pip3 install news-please | |
| ``` | |
| Next, run the crawler with our provided [config and URL list](https://github.com/google-research-datasets/NewSHead/releases) | |
| ```shell | |
| # Sets to path of the downloaded data folder. | |
| $ DATA_FOLDER=/path/to/downloaded_dataset | |
| # Uses CLI interface to crawl. We assume news_please subfolder contains the | |
| # decompressed config.cfg and sitelist.hjson. | |
| $ news-please -c $DATA_FOLDER/news_please | |
| ``` | |
| By default, it will store crawled | |
| articles under `/tmp/nhnet/`. To terminate the process press `CTRL+C`. | |
| The crawling may take some days (48 hours in our test) and it depends on the | |
| network environment and #threads set in the config. As the crawling tool won't | |
| stop automatically, it is not straightforward to check the progress. We suggest | |
| to terminate the job if there are no new articles crawled in a short time period | |
| (e.g., 10 minutes) by running | |
| ```shell | |
| $ find /tmp/nhnet -type f | wc -l | |
| ``` | |
| Please note that it is expected that some URLs are no longer available on the | |
| web as time goes by. | |
| ### Data Processing | |
| Given the crawled articles under `/tmp/nhnet/`, we would like to transform these | |
| textual articles into a set of `TFRecord` files containing serialized | |
| tensorflow.Example protocol buffers, with feature keys following the BERT | |
| [[2]](#2) tradition but is extended for multiple text segments. We will later | |
| use these processed TFRecords for training and evaluation. | |
| To do this, please first download a [BERT pretrained checkpoint](https://github.com/tensorflow/models/tree/master/official/nlp/bert#access-to-pretrained-checkpoints) | |
| (`BERT-Base,Uncased` preferred for efficiency) and decompress the `tar.gz` file. | |
| We need the vocabulary file and later use the checkpoint for NHNet | |
| initialization. | |
| Next, we can run the following data preprocess script which may take a few hours | |
| to read files and tokenize article content. | |
| ```shell | |
| # Recall that we use DATA_FOLDER=/path/to/downloaded_dataset. | |
| $ python3 raw_data_preprocess.py \ | |
| -crawled_articles=/tmp/nhnet \ | |
| -vocab=/path/to/bert_checkpoint/vocab.txt \ | |
| -do_lower_case=True \ | |
| -len_title=15 \ | |
| -len_passage=200 \ | |
| -max_num_articles=5 \ | |
| -data_folder=$DATA_FOLDER | |
| ``` | |
| This python script will export processed train/valid/eval files under | |
| `$DATA_FOLDER/processed/`. | |
| ## Training | |
| Please first install TensorFlow 2 and Tensorflow Model Garden following the | |
| [requirments section](https://github.com/tensorflow/models/tree/master/official#requirements). | |
| ### CPU/GPU | |
| ```shell | |
| $ python3 trainer.py \ | |
| --mode=train_and_eval \ | |
| --vocab=/path/to/bert_checkpoint/vocab.txt \ | |
| --init_checkpoint=/path/to/bert_checkpoint/bert_model.ckpt \ | |
| --params_override='init_from_bert2bert=false' \ | |
| --train_file_pattern=$DATA_FOLDER/processed/train.tfrecord* \ | |
| --model_dir=/path/to/output/model \ | |
| --len_title=15 \ | |
| --len_passage=200 \ | |
| --max_num_articles=5 \ | |
| --model_type=nhnet \ | |
| --train_batch_size=16 \ | |
| --train_steps=10000 \ | |
| --steps_per_loop=1 \ | |
| --checkpoint_interval=100 | |
| ``` | |
| ### TPU | |
| ```shell | |
| $ python3 trainer.py \ | |
| --mode=train_and_eval \ | |
| --vocab=/path/to/bert_checkpoint/vocab.txt \ | |
| --init_checkpoint=/path/to/bert_checkpoint/bert_model.ckpt \ | |
| --params_override='init_from_bert2bert=false' \ | |
| --train_file_pattern=$DATA_FOLDER/processed/train.tfrecord* \ | |
| --model_dir=/path/to/output/model \ | |
| --len_title=15 \ | |
| --len_passage=200 \ | |
| --max_num_articles=5 \ | |
| --model_type=nhnet \ | |
| --train_batch_size=1024 \ | |
| --train_steps=10000 \ | |
| --steps_per_loop=1000 \ | |
| --checkpoint_interval=1000 \ | |
| --distribution_strategy=tpu \ | |
| --tpu=grpc://${TPU_IP_ADDRESS}:8470 | |
| ``` | |
| In the paper, we train more than 10k steps with batch size set as 1024 with | |
| TPU-v3-64. | |
| Note that, `trainer.py` also supports `train` mode and continuous `eval` mode. | |
| For large scale TPU training, we recommend the have a process running the | |
| `train` mode and another process running the continuous `eval` mode which can | |
| runs on GPUs. | |
| This is the setting we commonly used for large-scale experiments, because `eval` | |
| will be non-blocking to the expensive training load. | |
| ### Metrics | |
| **Note: the metrics reported by `evaluation.py` are approximated on | |
| word-piece level rather than the real string tokens. Some metrics like BLEU | |
| scores can be off.** | |
| We will release a colab to evaluate results on string-level soon. | |
| ## References | |
| <a id="1">[1]</a> Xiaotao Gu, Yuning Mao, Jiawei Han, Jialu Liu, You Wu, Cong | |
| Yu, Daniel Finnie, Hongkun Yu, Jiaqi Zhai and Nicholas Zukoski "Generating | |
| Representative Headlines for News Stories": https://arxiv.org/abs/2001.09386. | |
| World Wide Web Conf. (WWW’2020). | |
| <a id="2">[2]</a> Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina | |
| Toutanova "BERT: Pre-training of Deep Bidirectional Transformers for Language | |
| Understanding": https://arxiv.org/abs/1810.04805. | |