picket-cliff commited on
Commit
58604c5
·
verified ·
1 Parent(s): d7c3e45

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -0
README.md CHANGED
@@ -48,11 +48,15 @@ To evaluate the model, data was separated between training and test datasets (80
48
  #### Preprocessing
49
 
50
  Deep learning models cannot process raw text; they require numerical tensors. We utilized the Hugging Face DistilBertTokenizer.
 
51
  1. Sub-word Tokenization: Instead of splitting by spaces (which struggles with typos and rare words), DistilBERT uses WordPiece tokenization. For example, an out-of-vocabulary word might be broken into known sub-words, preventing the model from encountering "Unknown" tokens.
 
52
  2. Special Tokens: The tokenizer automatically prepends the [CLS] (Classification) token to the start of the sequence and the [SEP] (Separator) token at the end. The final hidden state corresponding to the [CLS] token is what the model uses for the binary classification decision.
 
53
  3. Truncation and Padding: Transformer models require fixed-size input matrices for batch processing. Based on our EDA length distribution, we set max_length = 128.
54
  o Sentences longer than 128 tokens were truncated.
55
  o Sentences shorter than 128 tokens were padded with the [PAD] token (ID 0).
 
56
  4. Attention Masks: To prevent the model from performing Self-Attention on meaningless padding tokens, the tokenizer generates an attention_mask (an array of 1s for real words and 0s for padding).
57
 
58
 
 
48
  #### Preprocessing
49
 
50
  Deep learning models cannot process raw text; they require numerical tensors. We utilized the Hugging Face DistilBertTokenizer.
51
+
52
  1. Sub-word Tokenization: Instead of splitting by spaces (which struggles with typos and rare words), DistilBERT uses WordPiece tokenization. For example, an out-of-vocabulary word might be broken into known sub-words, preventing the model from encountering "Unknown" tokens.
53
+
54
  2. Special Tokens: The tokenizer automatically prepends the [CLS] (Classification) token to the start of the sequence and the [SEP] (Separator) token at the end. The final hidden state corresponding to the [CLS] token is what the model uses for the binary classification decision.
55
+
56
  3. Truncation and Padding: Transformer models require fixed-size input matrices for batch processing. Based on our EDA length distribution, we set max_length = 128.
57
  o Sentences longer than 128 tokens were truncated.
58
  o Sentences shorter than 128 tokens were padded with the [PAD] token (ID 0).
59
+
60
  4. Attention Masks: To prevent the model from performing Self-Attention on meaningless padding tokens, the tokenizer generates an attention_mask (an array of 1s for real words and 0s for padding).
61
 
62