cfinley
/

punct_restore_fr

Token Classification

Generated from Trainer

Model card Files Files and versions

cfinley commited on Jun 27, 2021

Commit

7abcf18

·

1 Parent(s): 686253f

Update README.md

Files changed (1) hide show

README.md +14 -4

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ should probably proofread and complete it, then remove this comment. -->
 # punct_restore_fr
-This model is a fine-tuned version of [camembert-base](https://huggingface.co/camembert-base) on an unkown dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.0301
 - Precision: 0.9601
@@ -34,15 +34,25 @@ It achieves the following results on the evaluation set:
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure

 # punct_restore_fr
+This model is a fine-tuned version of [camembert-base](https://huggingface.co/camembert-base) on a raw opensubtitles dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.0301
 - Precision: 0.9601
 ## Model description
+Classifies tokens based on beginning of sentence (B-SENT) and not (O).
 ## Intended uses & limitations
+This model aims to help punctuation restoration on French YouTube auto-generated subtitles.
 ## Training and evaluation data
+1 million Open Subtitles (French) sentences. 80%/10%/10% training/validation/test split.
+The sentences:
+- were lower-cased
+- had end punctuation (.?!) removed
+- were of length between 7 and 70 words
+- had beginning word of sentence tagged with B-SENT.
+    - All other words marked with O.
+Token/tag pairs batched together in groups of 64. This helps show variety of positions for B-SENT and O tags. This also keeps training examples from just being one sentence. Otherwise, this leads to having the first word and only the first word in a sequence being labeled B-SENT.
 ## Training procedure