Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

README.md +5 -5
config.json +1 -1
tokenizer_config.json +1 -1

README.md CHANGED Viewed

@@ -52,9 +52,9 @@ You can load and use this model using the `transformers` library from Hugging Fa
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
-import torch
-model_name = "Duino/Darija-LM"  # or path to your saved model locally
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(model_name)
@@ -77,12 +77,12 @@ generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
 print(generated_text)
 ```
-**Note on Tokenizer:** This model uses a SentencePiece tokenizer. When loading with `transformers`, it should automatically handle the SentencePiece model if it's correctly configured in the repository.
 ## Training Details
 The model was trained using the following steps:
-1. **Data Streaming and Preprocessing:**  Wikipedia datasets for Arabic and Darija were streamed using the `datasets` library and preprocessed.
 2. **SentencePiece Tokenization:** A SentencePiece model was trained on a sample of the Arabic Wikipedia data.
 3. **Model Training:** A GPT-like Transformer model was trained from scratch using PyTorch.
 4. **Memory Optimization:** Memory mapping was used to handle large datasets efficiently.
@@ -92,7 +92,7 @@ The model was trained using the following steps:
 ## Evaluation
-**[TODO: Include evaluation metrics if you have them. It's highly recommended to evaluate your model and add metrics here. For example, you could calculate perplexity on a held-out validation set.]**
 - [Metrics and results on a validation set or benchmark.]
 ## Citation

 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
+import sentencepiece as spm  # Ensure sentencepiece is installed
+model_name = "Duino/Darija-LM" # or path to your saved model locally
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(model_name)
 print(generated_text)
 ```
+**Note on Tokenizer:** This model uses a SentencePiece tokenizer.  When loading with `transformers`, it should automatically handle the SentencePiece model if it's correctly configured in the repository.
 ## Training Details
 The model was trained using the following steps:
+1. **Data Streaming and Preprocessing:**  Wikipedia datasets for Arabic and Darija were streamed using `datasets` library and preprocessed.
 2. **SentencePiece Tokenization:** A SentencePiece model was trained on a sample of the Arabic Wikipedia data.
 3. **Model Training:** A GPT-like Transformer model was trained from scratch using PyTorch.
 4. **Memory Optimization:** Memory mapping was used to handle large datasets efficiently.
 ## Evaluation
+**[TODO: Include evaluation metrics if you have them.  It's highly recommended to evaluate your model and add metrics here.  For example, you could calculate perplexity on a held-out validation set.]**
 - [Metrics and results on a validation set or benchmark.]
 ## Citation

config.json CHANGED Viewed

@@ -8,7 +8,7 @@
     "n_layer": 6,
     "block_size": 256,
     "dropout": 0.2,
-    "tokenizer_class": "PreTrainedTokenizerFast",
     "tokenizer_file": "spm_model.model",
     "_name_or_path": "Duino/Darija-LM",
     "model_type": "gpt2"

     "n_layer": 6,
     "block_size": 256,
     "dropout": 0.2,
+    "tokenizer_class": "SentencePieceTokenizerFast",
     "tokenizer_file": "spm_model.model",
     "_name_or_path": "Duino/Darija-LM",
     "model_type": "gpt2"

tokenizer_config.json CHANGED Viewed

@@ -1,4 +1,4 @@
 {
-    "tokenizer_class": "PreTrainedTokenizerFast",
     "model_file": "spm_model.model"
 }

 {
+    "tokenizer_class": "SentencePieceTokenizerFast",
     "model_file": "spm_model.model"
 }