Upload folder using huggingface_hub
Browse files- README.md +5 -5
- config.json +1 -1
- tokenizer_config.json +1 -1
README.md
CHANGED
|
@@ -52,9 +52,9 @@ You can load and use this model using the `transformers` library from Hugging Fa
|
|
| 52 |
|
| 53 |
```python
|
| 54 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 55 |
-
import
|
| 56 |
|
| 57 |
-
model_name = "Duino/Darija-LM"
|
| 58 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 59 |
model = AutoModelForCausalLM.from_pretrained(model_name)
|
| 60 |
|
|
@@ -77,12 +77,12 @@ generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
|
|
| 77 |
print(generated_text)
|
| 78 |
```
|
| 79 |
|
| 80 |
-
**Note on Tokenizer:** This model uses a SentencePiece tokenizer.
|
| 81 |
|
| 82 |
## Training Details
|
| 83 |
|
| 84 |
The model was trained using the following steps:
|
| 85 |
-
1. **Data Streaming and Preprocessing:** Wikipedia datasets for Arabic and Darija were streamed using
|
| 86 |
2. **SentencePiece Tokenization:** A SentencePiece model was trained on a sample of the Arabic Wikipedia data.
|
| 87 |
3. **Model Training:** A GPT-like Transformer model was trained from scratch using PyTorch.
|
| 88 |
4. **Memory Optimization:** Memory mapping was used to handle large datasets efficiently.
|
|
@@ -92,7 +92,7 @@ The model was trained using the following steps:
|
|
| 92 |
|
| 93 |
## Evaluation
|
| 94 |
|
| 95 |
-
**[TODO: Include evaluation metrics if you have them.
|
| 96 |
- [Metrics and results on a validation set or benchmark.]
|
| 97 |
|
| 98 |
## Citation
|
|
|
|
| 52 |
|
| 53 |
```python
|
| 54 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 55 |
+
import sentencepiece as spm # Ensure sentencepiece is installed
|
| 56 |
|
| 57 |
+
model_name = "Duino/Darija-LM" # or path to your saved model locally
|
| 58 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 59 |
model = AutoModelForCausalLM.from_pretrained(model_name)
|
| 60 |
|
|
|
|
| 77 |
print(generated_text)
|
| 78 |
```
|
| 79 |
|
| 80 |
+
**Note on Tokenizer:** This model uses a SentencePiece tokenizer. When loading with `transformers`, it should automatically handle the SentencePiece model if it's correctly configured in the repository.
|
| 81 |
|
| 82 |
## Training Details
|
| 83 |
|
| 84 |
The model was trained using the following steps:
|
| 85 |
+
1. **Data Streaming and Preprocessing:** Wikipedia datasets for Arabic and Darija were streamed using `datasets` library and preprocessed.
|
| 86 |
2. **SentencePiece Tokenization:** A SentencePiece model was trained on a sample of the Arabic Wikipedia data.
|
| 87 |
3. **Model Training:** A GPT-like Transformer model was trained from scratch using PyTorch.
|
| 88 |
4. **Memory Optimization:** Memory mapping was used to handle large datasets efficiently.
|
|
|
|
| 92 |
|
| 93 |
## Evaluation
|
| 94 |
|
| 95 |
+
**[TODO: Include evaluation metrics if you have them. It's highly recommended to evaluate your model and add metrics here. For example, you could calculate perplexity on a held-out validation set.]**
|
| 96 |
- [Metrics and results on a validation set or benchmark.]
|
| 97 |
|
| 98 |
## Citation
|
config.json
CHANGED
|
@@ -8,7 +8,7 @@
|
|
| 8 |
"n_layer": 6,
|
| 9 |
"block_size": 256,
|
| 10 |
"dropout": 0.2,
|
| 11 |
-
"tokenizer_class": "
|
| 12 |
"tokenizer_file": "spm_model.model",
|
| 13 |
"_name_or_path": "Duino/Darija-LM",
|
| 14 |
"model_type": "gpt2"
|
|
|
|
| 8 |
"n_layer": 6,
|
| 9 |
"block_size": 256,
|
| 10 |
"dropout": 0.2,
|
| 11 |
+
"tokenizer_class": "SentencePieceTokenizerFast",
|
| 12 |
"tokenizer_file": "spm_model.model",
|
| 13 |
"_name_or_path": "Duino/Darija-LM",
|
| 14 |
"model_type": "gpt2"
|
tokenizer_config.json
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
{
|
| 2 |
-
"tokenizer_class": "
|
| 3 |
"model_file": "spm_model.model"
|
| 4 |
}
|
|
|
|
| 1 |
{
|
| 2 |
+
"tokenizer_class": "SentencePieceTokenizerFast",
|
| 3 |
"model_file": "spm_model.model"
|
| 4 |
}
|