tafseer-nayeem
/

KidLM-plus

@@ -15,12 +15,12 @@ library_name: transformers
 We continue to pre-train the [RoBERTa (base)](https://huggingface.co/FacebookAI/roberta-base) model on our [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus) using a masked language modeling (MLM) objective. The KidLM (plus) model introduces a masking strategy called **Stratified Masking**, which varies the probability of masking based on word classes. This approach enhances the model's focus on tokens that are more informative and specifically tailored to children's language needs, aiming to steer language model predictions towards child-specific vocabulary derived from our high-quality [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus).
-To achieve this, Stratified Masking is introduced based on two key principles:
 1. All words in our corpus have a non-zero probability of being masked.
 2. Words more commonly found in a general corpus are masked with a lower probability.
-Based on these principles, each word in our corpus is assigned to one of the following three strata:
 - **Stopwords**: These are the most frequent words in the language. We apply a **0.15** masking rate to these words.
@@ -80,7 +80,7 @@ print(predictions_kidLM_plus)
 ## Limitations and bias
-The training data used to build the KidLM model is our [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus). We made significant efforts to minimize offensive content in the pre-training data by deliberately sourcing from sites where such content is minimal. However, we cannot provide an absolute guarantee that no such content is present. We strongly recommend exercising caution when using the KidLM model, as it may still produce biased predictions.
 ```python
 from transformers import pipeline

 We continue to pre-train the [RoBERTa (base)](https://huggingface.co/FacebookAI/roberta-base) model on our [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus) using a masked language modeling (MLM) objective. The KidLM (plus) model introduces a masking strategy called **Stratified Masking**, which varies the probability of masking based on word classes. This approach enhances the model's focus on tokens that are more informative and specifically tailored to children's language needs, aiming to steer language model predictions towards child-specific vocabulary derived from our high-quality [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus).
+To achieve this, Stratified Masking is introduced based on **two key principles**:
 1. All words in our corpus have a non-zero probability of being masked.
 2. Words more commonly found in a general corpus are masked with a lower probability.
+Based on these principles, each word in our corpus is assigned to one of the following **three strata**:
 - **Stopwords**: These are the most frequent words in the language. We apply a **0.15** masking rate to these words.
 ## Limitations and bias
+The training data used to build the KidLM (plus) model is our [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus). We made significant efforts to minimize offensive content in the pre-training data by deliberately sourcing from sites where such content is minimal. However, we cannot provide an absolute guarantee that no such content is present. We strongly recommend exercising caution when using the KidLM model, as it may still produce biased predictions.
 ```python
 from transformers import pipeline