tafseer-nayeem commited on
Commit
1640cd1
·
verified ·
1 Parent(s): 05b076e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -15,12 +15,12 @@ library_name: transformers
15
 
16
  We continue to pre-train the [RoBERTa (base)](https://huggingface.co/FacebookAI/roberta-base) model on our [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus) using a masked language modeling (MLM) objective. The KidLM (plus) model introduces a masking strategy called **Stratified Masking**, which varies the probability of masking based on word classes. This approach enhances the model's focus on tokens that are more informative and specifically tailored to children's language needs, aiming to steer language model predictions towards child-specific vocabulary derived from our high-quality [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus).
17
 
18
- To achieve this, Stratified Masking is introduced based on two key principles:
19
 
20
  1. All words in our corpus have a non-zero probability of being masked.
21
  2. Words more commonly found in a general corpus are masked with a lower probability.
22
 
23
- Based on these principles, each word in our corpus is assigned to one of the following three strata:
24
 
25
  - **Stopwords**: These are the most frequent words in the language. We apply a **0.15** masking rate to these words.
26
 
@@ -80,7 +80,7 @@ print(predictions_kidLM_plus)
80
 
81
  ## Limitations and bias
82
 
83
- The training data used to build the KidLM model is our [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus). We made significant efforts to minimize offensive content in the pre-training data by deliberately sourcing from sites where such content is minimal. However, we cannot provide an absolute guarantee that no such content is present. We strongly recommend exercising caution when using the KidLM model, as it may still produce biased predictions.
84
 
85
  ```python
86
  from transformers import pipeline
 
15
 
16
  We continue to pre-train the [RoBERTa (base)](https://huggingface.co/FacebookAI/roberta-base) model on our [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus) using a masked language modeling (MLM) objective. The KidLM (plus) model introduces a masking strategy called **Stratified Masking**, which varies the probability of masking based on word classes. This approach enhances the model's focus on tokens that are more informative and specifically tailored to children's language needs, aiming to steer language model predictions towards child-specific vocabulary derived from our high-quality [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus).
17
 
18
+ To achieve this, Stratified Masking is introduced based on **two key principles**:
19
 
20
  1. All words in our corpus have a non-zero probability of being masked.
21
  2. Words more commonly found in a general corpus are masked with a lower probability.
22
 
23
+ Based on these principles, each word in our corpus is assigned to one of the following **three strata**:
24
 
25
  - **Stopwords**: These are the most frequent words in the language. We apply a **0.15** masking rate to these words.
26
 
 
80
 
81
  ## Limitations and bias
82
 
83
+ The training data used to build the KidLM (plus) model is our [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus). We made significant efforts to minimize offensive content in the pre-training data by deliberately sourcing from sites where such content is minimal. However, we cannot provide an absolute guarantee that no such content is present. We strongly recommend exercising caution when using the KidLM model, as it may still produce biased predictions.
84
 
85
  ```python
86
  from transformers import pipeline