arcadinis commited on
Commit
b735fb7
·
verified ·
1 Parent(s): 4b7f9f5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -0
README.md CHANGED
@@ -1,3 +1,80 @@
1
  ---
 
 
 
 
2
  license: apache-2.0
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
+ tags:
4
+ - text segmentation
5
+ - document chunking
6
  license: apache-2.0
7
+ datasets:
8
+ - wikipedia
9
+ pipeline_tag: text-classification
10
+ base_model: "distilbert/distilbert-base-uncased"
11
  ---
12
+
13
+ # DistilBERT Cross Segment Document Chunking
14
+
15
+ This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) for classifying if two subsequent sentences are from the same Wikipedia article section. Intended usage is **text segmantation/document chunking**.
16
+ It is based on the article *Text Segmentation by Cross Segment Attention by Michal Lukasik, Boris Dadachev, Gonc¸alo Simoes and Kishore Papineni*.
17
+
18
+ ## How to use it
19
+
20
+ One way to use this model is via the HuggingFace transformers TextClassificationPipeline class.
21
+
22
+ ```python
23
+ from transformers import (
24
+ AutoModelForSequenceClassification,
25
+ DistilBertTokenizer,
26
+ TextClassificationPipeline
27
+ )
28
+
29
+ model_name = "BlueOrangeDigital/distilbert-cross-segment-document-chunking"
30
+
31
+ id2label = {0: "SAME", 1: "DIFFERENT"}
32
+ label2id = {"SAME": 0, "DIFFERENT": 1}
33
+
34
+ tokenizer = DistilBertTokenizer.from_pretrained(model_name)
35
+ model = AutoModelForSequenceClassification.from_pretrained(
36
+ model_name,
37
+ num_labels=2,
38
+ id2label=id2label,
39
+ label2id=label2id
40
+ )
41
+
42
+ pairs = [
43
+ "Left context. [SEP] Right context.",
44
+ "he also earned five mvp stars with the martian men's tenis team in 2149. [SEP] mart jhones spent the 2166 and 2167 seasons with the all stars intergalactic in the interstelar soccer league ( isl ).",
45
+ ]
46
+ pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True)
47
+
48
+ pipe(pairs)
49
+
50
+ [[{'label': 'SAME', 'score': 0.9845659136772156},
51
+ {'label': 'DIFFERENT', 'score': 0.015434039756655693}],
52
+ [{'label': 'SAME', 'score': 0.44031277298927307},
53
+ {'label': 'DIFFERENT', 'score': 0.5596872568130493}]]
54
+ ```
55
+
56
+ ## Training Data
57
+ Sentences pairs from 40,000 (train) + 4,000 (validation) Wikipedia articles.
58
+ **Label 1:** Two subsequent sentences that are not from the same article section;
59
+ **Label 0:** Every other pair of subsequent sentences.
60
+
61
+ Label 0 pairs were undersampled, resulting in a total of 408,753 and 45,417 training and validation pairs, respectively.
62
+
63
+ The input of the model are of the form
64
+
65
+ ```
66
+ [CLS] Right context [SEP] Left context [SEP]
67
+ ```
68
+
69
+ Given DistilBERT 512 token limit, both right and left context are limited to 255 token length. When exceeding this limit, the sentence was truncated (either the beggining or the end of the sentence, for right and left context, respectively).
70
+
71
+ ## Trainig Procedure
72
+
73
+ The model was trained for 2 epochs with a learning rate of 1e-5 and cross-entropy loss on a P100 GPU for 8 hours.
74
+
75
+
76
+ ## Validation Metrics
77
+
78
+ | Loss | Accuracy | Recall | Precision | F1 |
79
+ |:----:|:----:|:----:|:-----:|:----:|
80
+ | 0.398 | 0.815 | 0.815 | 0.817 | 0.815 |