BlueOrangeDigital
/

distilbert-cross-segment-document-chunking

Text Classification

text segmentation

document chunking

Model card Files Files and versions

arcadinis commited on Feb 14, 2024

Commit

b735fb7

·

verified ·

1 Parent(s): 4b7f9f5

Update README.md

Files changed (1) hide show

README.md +77 -0

README.md CHANGED Viewed

@@ -1,3 +1,80 @@
 ---
 license: apache-2.0
 ---

 ---
+language: en
+tags:
+- text segmentation
+- document chunking
 license: apache-2.0
+datasets:
+- wikipedia
+pipeline_tag: text-classification
+base_model: "distilbert/distilbert-base-uncased"
 ---
+# DistilBERT Cross Segment Document Chunking
+This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) for classifying if two subsequent sentences are from the same Wikipedia article section. Intended usage is **text segmantation/document chunking**.
+It is based on the article *Text Segmentation by Cross Segment Attention by Michal Lukasik, Boris Dadachev, Gonc¸alo Simoes and Kishore Papineni*.
+## How to use it
+One way to use this model is via the HuggingFace transformers TextClassificationPipeline class.
+```python
+from transformers import (
+    AutoModelForSequenceClassification,
+    DistilBertTokenizer,
+    TextClassificationPipeline
+)
+model_name = "BlueOrangeDigital/distilbert-cross-segment-document-chunking"
+id2label = {0: "SAME", 1: "DIFFERENT"}
+label2id = {"SAME": 0, "DIFFERENT": 1}
+tokenizer = DistilBertTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(
+    model_name,
+    num_labels=2,
+    id2label=id2label,
+    label2id=label2id
+)
+pairs = [
+    "Left context. [SEP] Right context.",
+    "he also earned five mvp stars with the martian men's tenis team in 2149. [SEP] mart jhones spent the 2166 and 2167 seasons with the all stars intergalactic in the interstelar soccer league ( isl ).",
+]
+pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True)
+pipe(pairs)
+[[{'label': 'SAME', 'score': 0.9845659136772156},
+  {'label': 'DIFFERENT', 'score': 0.015434039756655693}],
+ [{'label': 'SAME', 'score': 0.44031277298927307},
+  {'label': 'DIFFERENT', 'score': 0.5596872568130493}]]
+```
+## Training Data
+Sentences pairs from 40,000 (train) + 4,000 (validation) Wikipedia articles.
+**Label 1:** Two subsequent sentences that are not from the same article section;
+**Label 0:** Every other pair of subsequent sentences.
+Label 0 pairs were undersampled, resulting in a total of 408,753 and 45,417 training and validation pairs, respectively.
+The input of the model are of the form
+```
+[CLS] Right context [SEP] Left context [SEP]
+```
+Given DistilBERT 512 token limit, both right and left context are limited to 255 token length. When exceeding this limit, the sentence was truncated (either the beggining or the end of the sentence, for right and left context, respectively).
+## Trainig Procedure
+The model was trained for 2 epochs with a learning rate of 1e-5 and cross-entropy loss on a P100 GPU for 8 hours.
+## Validation Metrics
+| Loss | Accuracy  | Recall | Precision | F1 |
+|:----:|:----:|:----:|:-----:|:----:|
+| 0.398 | 0.815 | 0.815 | 0.817  | 0.815 |