| | --- |
| | language: |
| | - en |
| | tags: |
| | - sentence-similarity |
| | - text-classification |
| | datasets: |
| | - dennlinger/wiki-paragraphs |
| | metrics: |
| | - f1 |
| | license: mit |
| | --- |
| | |
| | # BERT-Wiki-Paragraphs |
| |
|
| | Authors: Satya Almasian\*, Dennis Aumiller\*, Lucienne-Sophie Marmé, Michael Gertz |
| | Contact us at `<lastname>@informatik.uni-heidelberg.de` |
| | Details for the training method can be found in our work [Structural Text Segmentation of Legal Documents](https://arxiv.org/abs/2012.03619). |
| | The training procedure follows the same setup, but we substitute legal documents for Wikipedia in this model. |
| | Find the associated training data here: [wiki-paragraphs](https://huggingface.co/datasets/dennlinger/wiki-paragraphs) |
| |
|
| | Training is performed in a form of weakly-supervised fashion to determine whether paragraphs topically belong together or not. |
| | We utilize automatically generated samples from Wikipedia for training, where paragraphs from within the same section are assumed to be topically coherent. |
| | We use the same articles as ([Koshorek et al., 2018](https://arxiv.org/abs/1803.09337)), |
| | albeit from a 2021 dump of Wikpeida, and split at paragraph boundaries instead of the sentence level. |
| |
|
| | ## Usage |
| | Preferred usage is through `transformers.pipeline`: |
| |
|
| | ```python |
| | from transformers import pipeline |
| | pipe = pipeline("text-classification", model="dennlinger/bert-wiki-paragraphs") |
| | |
| | pipe("{First paragraph} [SEP] {Second paragraph}") |
| | ``` |
| | A predicted "1" means that paragraphs belong to the same topic, a "0" indicates a disconnect. |
| |
|
| | ## Training Setup |
| | The model was trained for 3 epochs from `bert-base-uncased` on paragraph pairs (limited to 512 subwork with the `longest_first` truncation strategy). |
| | We use a batch size of 24 wit 2 iterations gradient accumulation (effective batch size of 48), and a learning rate of 1e-4, with gradient clipping at 5. |
| | Training was performed on a single Titan RTX GPU over the duration of 3 weeks. |
| |
|