Commit ·
c8d6e52
1
Parent(s): 3c0d8f4
Include Lucy in acknowledgments, extend
Browse files
README.md
CHANGED
|
@@ -1,15 +1,16 @@
|
|
| 1 |
# BERT-Wiki-Paragraphs
|
| 2 |
|
| 3 |
-
Authors: Satya Almasian*, Dennis Aumiller*, Michael Gertz
|
|
|
|
| 4 |
Details for the training method can be found in our work [Structural Text Segmentation of Legal Documents](https://arxiv.org/abs/2012.03619).
|
| 5 |
The training procedure follows the same setup, but we substitute legal documents for Wikipedia in this model.
|
| 6 |
|
| 7 |
Training is performed in a form of weakly-supervised fashion to determine whether paragraphs topically belong together or not.
|
| 8 |
-
We utilize automatically generated samples from Wikipedia for training,
|
| 9 |
-
|
| 10 |
-
We use the same articles as ([Koshorek et al., 2018](https://arxiv.org/abs/1803.09337)),
|
| 11 |
albeit from a 2021 dump of Wikpeida, and split at paragraph boundaries instead of the sentence level.
|
| 12 |
|
| 13 |
## Training Setup
|
| 14 |
-
The model was trained for 3 epochs from
|
|
|
|
| 15 |
Training was performed on a single Titan RTX GPU over the duration of 3 weeks.
|
|
|
|
| 1 |
# BERT-Wiki-Paragraphs
|
| 2 |
|
| 3 |
+
Authors: Satya Almasian\*, Dennis Aumiller\*, Lucienne-Sophie Marmé, Michael Gertz
|
| 4 |
+
Contact us at `<lastname>@informatik.uni-heidelberg.de`
|
| 5 |
Details for the training method can be found in our work [Structural Text Segmentation of Legal Documents](https://arxiv.org/abs/2012.03619).
|
| 6 |
The training procedure follows the same setup, but we substitute legal documents for Wikipedia in this model.
|
| 7 |
|
| 8 |
Training is performed in a form of weakly-supervised fashion to determine whether paragraphs topically belong together or not.
|
| 9 |
+
We utilize automatically generated samples from Wikipedia for training, where paragraphs from within the same section are assumed to be topically coherent.
|
| 10 |
+
We use the same articles as ([Koshorek et al., 2018](https://arxiv.org/abs/1803.09337)),
|
|
|
|
| 11 |
albeit from a 2021 dump of Wikpeida, and split at paragraph boundaries instead of the sentence level.
|
| 12 |
|
| 13 |
## Training Setup
|
| 14 |
+
The model was trained for 3 epochs from `bert-base-uncased` on paragraph pairs (limited to 512 subwork with the `longest_first` truncation strategy).
|
| 15 |
+
We use a batch size of 24 wit 2 iterations gradient accumulation (effective batch size of 48), and a learning rate of 1e-4, with gradient clipping at 5.
|
| 16 |
Training was performed on a single Titan RTX GPU over the duration of 3 weeks.
|