Instructions to use picket-cliff/deepl-project-model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use picket-cliff/deepl-project-model with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="picket-cliff/deepl-project-model")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("picket-cliff/deepl-project-model") model = AutoModelForSequenceClassification.from_pretrained("picket-cliff/deepl-project-model") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -48,11 +48,15 @@ To evaluate the model, data was separated between training and test datasets (80
|
|
| 48 |
#### Preprocessing
|
| 49 |
|
| 50 |
Deep learning models cannot process raw text; they require numerical tensors. We utilized the Hugging Face DistilBertTokenizer.
|
|
|
|
| 51 |
1. Sub-word Tokenization: Instead of splitting by spaces (which struggles with typos and rare words), DistilBERT uses WordPiece tokenization. For example, an out-of-vocabulary word might be broken into known sub-words, preventing the model from encountering "Unknown" tokens.
|
|
|
|
| 52 |
2. Special Tokens: The tokenizer automatically prepends the [CLS] (Classification) token to the start of the sequence and the [SEP] (Separator) token at the end. The final hidden state corresponding to the [CLS] token is what the model uses for the binary classification decision.
|
|
|
|
| 53 |
3. Truncation and Padding: Transformer models require fixed-size input matrices for batch processing. Based on our EDA length distribution, we set max_length = 128.
|
| 54 |
o Sentences longer than 128 tokens were truncated.
|
| 55 |
o Sentences shorter than 128 tokens were padded with the [PAD] token (ID 0).
|
|
|
|
| 56 |
4. Attention Masks: To prevent the model from performing Self-Attention on meaningless padding tokens, the tokenizer generates an attention_mask (an array of 1s for real words and 0s for padding).
|
| 57 |
|
| 58 |
|
|
|
|
| 48 |
#### Preprocessing
|
| 49 |
|
| 50 |
Deep learning models cannot process raw text; they require numerical tensors. We utilized the Hugging Face DistilBertTokenizer.
|
| 51 |
+
|
| 52 |
1. Sub-word Tokenization: Instead of splitting by spaces (which struggles with typos and rare words), DistilBERT uses WordPiece tokenization. For example, an out-of-vocabulary word might be broken into known sub-words, preventing the model from encountering "Unknown" tokens.
|
| 53 |
+
|
| 54 |
2. Special Tokens: The tokenizer automatically prepends the [CLS] (Classification) token to the start of the sequence and the [SEP] (Separator) token at the end. The final hidden state corresponding to the [CLS] token is what the model uses for the binary classification decision.
|
| 55 |
+
|
| 56 |
3. Truncation and Padding: Transformer models require fixed-size input matrices for batch processing. Based on our EDA length distribution, we set max_length = 128.
|
| 57 |
o Sentences longer than 128 tokens were truncated.
|
| 58 |
o Sentences shorter than 128 tokens were padded with the [PAD] token (ID 0).
|
| 59 |
+
|
| 60 |
4. Attention Masks: To prevent the model from performing Self-Attention on meaningless padding tokens, the tokenizer generates an attention_mask (an array of 1s for real words and 0s for padding).
|
| 61 |
|
| 62 |
|