| | --- |
| | datasets: |
| | - CCRss/small-chatgpt-paraphrases-kz |
| | language: |
| | - kk |
| | library_name: transformers |
| | tags: |
| | - text-generation-inference |
| | license: mit |
| | --- |
| | ## Model Overview |
| | The **qqp_kz** model is paraphrasing tool tailored for the Kazakh language. It is built upon the **humarin/chatgpt_paraphraser_on_T5_base model**, inheriting its robust architecture and adapting it for the nuances of Kazakh. |
| | |
| | ### Key Features: |
| | - Language: Specifically designed for paraphrasing in Kazakh. |
| | - Base Model: Derived from **chatgpt_paraphraser_on_T5_base**, a proven model in paraphrasing tasks. |
| | - Tokenizer: Utilizes **CCRss/tokenizer_t5_kz** for optimal Kazakh language processing. |
| | |
| | Data Preprocessing |
| | The dataset used for training the qqp_kz model undergoes rigorous preprocessing to ensure compatibility and optimal performance: |
| | ```python |
| | # Importing necessary modules from the transformers library |
| | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
| | |
| | # Initializing the tokenizer for the specific model. This tokenizer is used to convert |
| | # text input into a format that is understandable by the model. |
| | tokenizer = AutoTokenizer.from_pretrained("CCRss/tokenizer_t5_kz") |
| | |
| | # Define a function for preprocessing the data. This function takes an example |
| | # (which includes source and target texts) and tokenizes both texts using the tokenizer. |
| | # The tokenized output is then formatted to a fixed length for consistent model input. |
| | def preprocess_data(example): |
| | # Extracting the source and target texts from the example |
| | source = example["src"] |
| | target = example["trg"] |
| | |
| | # Tokenizing the source text with padding and truncation to ensure a fixed length |
| | source_inputs = tokenizer(source, padding="max_length", truncation=True, max_length=128) |
| | |
| | # Tokenizing the target text with padding and truncation to ensure a fixed length |
| | target_inputs = tokenizer(target, padding="max_length", truncation=True, max_length=128) |
| | |
| | # Returning the tokenized inputs, combining both source and target, and setting the target as labels |
| | return {**source_inputs, **target_inputs, "labels": target_inputs["input_ids"]} |
| | |
| | # Applying the preprocessing function to the dataset, effectively transforming all text data |
| | # into a tokenized format suitable for the Seq2Seq model. |
| | encoded_dataset = dataset.map(preprocess_data) |
| | # Setting the format of the dataset to PyTorch tensors for compatibility with the training framework. |
| | encoded_dataset.set_format("torch") |
| | |
| | ``` |
| | ### Model Training |
| | |
| | The model is trained with the following configuration: |
| | |
| | ```python |
| | |
| | # Importing necessary classes for training from the transformers library |
| | from transformers import TrainingArguments, Seq2SeqTrainer |
| | |
| | # Name of the pretrained model to be used for Seq2Seq learning |
| | name_of_model = "humarin/chatgpt_paraphraser_on_T5_base" |
| | # Loading the model from the pretrained weights |
| | model = AutoModelForSeq2SeqLM.from_pretrained(name_of_model) |
| | |
| | # Setting up training arguments. This includes batch size, learning rate, number of epochs, |
| | # directories for saving results and logs, and evaluation strategy. |
| | training_args = Seq2SeqTrainingArguments( |
| | per_device_train_batch_size=21, |
| | gradient_accumulation_steps=3, |
| | learning_rate=5e-5, |
| | save_steps=2000, |
| | num_train_epochs=3, |
| | output_dir='./results', |
| | logging_dir='./logs', |
| | logging_steps=2000, |
| | eval_steps=2000, |
| | evaluation_strategy="steps" |
| | ) |
| | |
| | # Initializing the trainer with the model, training arguments, and the datasets for training and evaluation. |
| | trainer = Seq2SeqTrainer( |
| | model=model, |
| | args=training_args, |
| | train_dataset=encoded_dataset['train'], |
| | eval_dataset=encoded_dataset['valid'] |
| | ) |
| | |
| | # Starting the training process of the model using the specified datasets and training arguments. |
| | trainer.train() |
| | ``` |
| | |
| | ### Usage |
| | The **qqp_kz** model is specifically designed for paraphrasing in the Kazakh language. It is highly suitable for a variety of NLP tasks such as content creation, enhancing translations, and linguistic research. |
| |
|
| | To utilize the model: |
| |
|
| | - Install the transformers library. |
| | - Load the model using the Hugging Face API. |
| | - Input your Kazakh text for paraphrasing. |
| |
|
| | ### Example Deployment |
| | For a practical demonstration of the model in action, please refer to our [Google Colab notebook](https://colab.research.google.com/drive/1ieNhrPnh-MEAlmMgGFVffB1LLXtaXsuf?usp=sharing). This notebook provides a comprehensive example of how to infer with the qqp_kz model. |
| | |
| | ### Contributions and Feedback |
| | We welcome contributions to the qqp_kz model. If you have suggestions, improvements, or encounter any issues, please feel free to open an issue in the repository. |