Create Industry-Specific-Tokens_Tokenizer_ReadMe.md
Browse files
Industry-Specific-Tokens_Tokenizer_ReadMe.md
ADDED
|
@@ -0,0 +1,83 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
This customized tokenizer is a necessary and practical component of the supply chain forecasting model.
|
| 3 |
+
The custom tokenizer is designed for the Enhanced Business Model for Collaborative Predictive Supply Chain.
|
| 4 |
+
It prioritizes industry-specific tokens from a `vocab.json` file and uses Byte-Pair Encoding (BPE) for out-of-vocabulary (OOV) words.
|
| 5 |
+
It has dedicated tokens to handle various data and feature types expected in supply chain data.
|
| 6 |
+
It handles the specific requirements of the model, including the custom industry-specific vocabulary, BPE training, and preparation of data for a Transformer.
|
| 7 |
+
|
| 8 |
+
Key Features:
|
| 9 |
+
|
| 10 |
+
1. **Custom Vocabulary Prioritization:** The tokenizer initializes its
|
| 11 |
+
vocabulary from a `vocab.json` file. This file contains predefined tokens
|
| 12 |
+
for common entities in the supply chain domain (e.g., specific SKUs,
|
| 13 |
+
store identifiers, manufacturer plant codes). These tokens are given
|
| 14 |
+
precedence over tokens learned through BPE.
|
| 15 |
+
|
| 16 |
+
2. **Byte-Pair Encoding (BPE) for OOV Handling:** To handle variations in
|
| 17 |
+
product names, new SKUs, or other unseen words, the tokenizer incorporates
|
| 18 |
+
BPE. The `train_bpe` method allows the tokenizer to learn subword units
|
| 19 |
+
from a provided text corpus, enabling it to represent words not present in
|
| 20 |
+
the initial `vocab.json`.
|
| 21 |
+
|
| 22 |
+
3. **Data Preprocessing for Transformers:** The `prepare_for_model` method
|
| 23 |
+
is crucial for integrating the tokenizer with the model's data pipeline.
|
| 24 |
+
It takes a Pandas DataFrame (containing features like timestamp, SKU,
|
| 25 |
+
store ID, quantity, price, promotions) and transforms each row into a
|
| 26 |
+
sequence of token IDs and an attention mask, ready for input to a
|
| 27 |
+
Transformer model. This includes:
|
| 28 |
+
* Constructing a formatted input string from DataFrame columns.
|
| 29 |
+
* Tokenizing the string using the custom vocabulary and BPE.
|
| 30 |
+
* Padding (or truncating) sequences to a specified `max_length`.
|
| 31 |
+
* Creating an attention mask to indicate valid tokens vs. padding.
|
| 32 |
+
|
| 33 |
+
4. **Standard Tokenization Operations:** The tokenizer provides standard
|
| 34 |
+
methods expected of a tokenizer:
|
| 35 |
+
* `encode`: Converts text to a list of tokens.
|
| 36 |
+
* `encode_as_ids`: Converts text to a list of token IDs.
|
| 37 |
+
* `decode`: Converts token IDs back to text.
|
| 38 |
+
* `token_to_id`: Converts a token string to its ID.
|
| 39 |
+
* `id_to_token`: Converts a token ID to its string representation.
|
| 40 |
+
* `get_vocab_size`: Returns size of vocabulary.
|
| 41 |
+
|
| 42 |
+
5. **Saving and Loading:** The `save` and `from_pretrained` methods allow
|
| 43 |
+
for easy persistence and reuse of the trained tokenizer, including both
|
| 44 |
+
the tokenizer configuration (in Hugging Face's `tokenizer.json` format)
|
| 45 |
+
and a copy of the `vocab.json`.
|
| 46 |
+
|
| 47 |
+
6. **Integration with `tokenizers` Library:** The tokenizer is built using
|
| 48 |
+
the `tokenizers` library from Hugging Face, ensuring efficiency and
|
| 49 |
+
compatibility with other Hugging Face tools.
|
| 50 |
+
|
| 51 |
+
7. **Normalization and Pre-Tokenization**: Includes lowercase and Unicode
|
| 52 |
+
normalization and pre-tokenization (splitting on whitespace and individual
|
| 53 |
+
digits.)
|
| 54 |
+
|
| 55 |
+
8. **Special Tokens**: Handles special tokens ([UNK], [CLS], [SEP], [PAD],
|
| 56 |
+
[MASK]) for Transformer models.
|
| 57 |
+
|
| 58 |
+
The Tokenizer2 script also includes a comprehensive example usage section demonstrating how to create, train, use, save, and load the tokenizer. This tokenizer is a
|
| 59 |
+
critical component for bridging the gap between raw supply chain data and a Transformer-based forecasting model.
|
| 60 |
+
|
| 61 |
+
Key aspects and explanations:
|
| 62 |
+
|
| 63 |
+
* **Custom Vocabulary Handling:** The `vocab.json` gives priority to the industry-specific tokens.
|
| 64 |
+
|
| 65 |
+
* **Complete:** This code includes creating a dummy `vocab.json`, training the BPE model, encoding, decoding, saving, loading, and preparing a Pandas DataFrame for input.
|
| 66 |
+
* **BPE Training:** The `train_bpe` method is implemented using the `tokenizers` library's `BpeTrainer`. This allows the tokenizer to learn merges for words *not* in the initial vocabulary. This is crucial for handling variations in product names, new SKUs, etc.
|
| 67 |
+
* **Normalization and Pre-tokenization:** Includes standard normalization (lowercase, Unicode normalization) and pre-tokenization (splitting on whitespace and individual digits).
|
| 68 |
+
* **Special Tokens:** Correctly handles special tokens (`[UNK]`, `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`) for Transformer models.
|
| 69 |
+
* **Encoding and Decoding:** Provides methods for encoding text into tokens and IDs, and for decoding IDs back into text.
|
| 70 |
+
* **`token_to_id` and `id_to_token`:** Added methods for converting between tokens and IDs.
|
| 71 |
+
* **`get_vocab_size`:** Added for completeness.
|
| 72 |
+
* **Saving and Loading:** Implements `save` and `from_pretrained` methods for persisting the tokenizer, including both the `tokenizer.json` (Hugging Face's configuration) and a copy of the `vocab.json`.
|
| 73 |
+
* **`prepare_for_model` Method:** This is the *most important* addition. This method takes a Pandas DataFrame as input and:
|
| 74 |
+
* Constructs an input string for each row, combining the relevant features (timestamp, SKU, store ID, etc.). *This is where you define the input format for your model.*
|
| 75 |
+
* Tokenizes the string.
|
| 76 |
+
* Handles padding (or truncation) to the `max_length`.
|
| 77 |
+
* Creates the attention mask.
|
| 78 |
+
* Returns the `input_ids` (list of token ID sequences) and `attention_masks` ready for use in a Transformer model.
|
| 79 |
+
* **Clear Example Usage:** The `if __name__ == "__main__":` block provides a comprehensive example showing how to use all the key methods.
|
| 80 |
+
* **Error Handling:** Includes a check for the existence of the `vocab.json` file.
|
| 81 |
+
* **Type Hinting:** Uses type hints for better code clarity and maintainability.
|
| 82 |
+
* **Pandas Integration:** The tokenizer is designed to work directly with Pandas DataFrames, which are commonly used for this type of data.
|
| 83 |
+
|