punct_restore_fr / README.md

Librarian Bot: Add base_model information to model

313cbfd over 2 years ago

2.2 kB

	---
	license: mit
	tags:
	- generated_from_trainer
	metrics:
	- precision
	- recall
	- f1
	- accuracy
	model_index:
	- name: punct_restore_fr
	results:
	- task:
	name: Token Classification
	type: token-classification
	metric:
	name: Accuracy
	type: accuracy
	value: 0.991500810518732
	base_model: camembert-base
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# punct_restore_fr

	This model is a fine-tuned version of [camembert-base](https://huggingface.co/camembert-base) on a raw, French opensubtitles dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.0301
	- Precision: 0.9601
	- Recall: 0.9527
	- F1: 0.9564
	- Accuracy: 0.9915

	## Model description

	Classifies tokens based on beginning of French sentences (B-SENT) and everything else (O).

	## Intended uses & limitations

	This model aims to help punctuation restoration on French YouTube auto-generated subtitles. In doing so, one can measure more in a corpus such as words per sentence, grammar structures per sentence, etc.

	## Training and evaluation data

	1 million Open Subtitles (French) sentences. 80%/10%/10% training/validation/test split.

	The sentences:

	- were lower-cased
	- had end punctuation (.?!) removed
	- were of length between 7 and 70 words
	- had beginning word of sentence tagged with B-SENT.
	- All other words marked with O.

	Token/tag pairs batched together in groups of 64. This helps show variety of positions for B-SENT and O tags. This also keeps training examples from just being one sentence. Otherwise, this leads to having the first word and only the first word in a sequence being labeled B-SENT.

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 16
	- eval_batch_size: 8
	- seed: 1
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 3

	### Training results



	### Framework versions

	- Transformers 4.8.1
	- Pytorch 1.9.0+cu102
	- Datasets 1.8.0
	- Tokenizers 0.10.3