Tamazight Language Model with SentencePiece BPE Tokenizer
This is a custom Transformer-based language model trained from scratch on Tamazight (Berber) text with a SentencePiece BPE tokenizer.
Model Details
Architecture
- Model Type: Transformer-based Language Model
- Parameters: 98,299,392
- Hidden Size: 568
- Attention Heads: 12
- Layers: 12
- Maximum Sequence Length: 256
- Vocabulary Size: 8,000
Tokenizer
- Type: SentencePiece BPE (Byte-Pair Encoding)
- Vocabulary Size: 8,000
- Special Tokens:
<s>,</s>,<unk>,<pad> - Scripts Supported: Tifinagh, Latin, Arabic
Usage
Loading the Model
import sentencepiece as spm
import torch
# Load tokenizer
sp = spm.SentencePieceProcessor(model_file='tokenizer.model')
# Load model
model = torch.load('pytorch_model.bin', map_location='cpu')
Tokenizing Text
Copied # Encode
text = "Asalam"
pieces = sp.encode_as_pieces(text)
ids = sp.encode_as_ids(text)
# Decode
decoded = sp.decode_ids(ids) Files
pytorch_model.bin: Model weights
config.json: Model configuration
tokenizer.model: SentencePiece BPE tokenizer
tokenizer_config.json: Tokenizer configuration
special_tokens_map.json: Special tokens mapping
Languages
Supports Tamazight text in multiple scripts:
Tifinagh: Traditional Amazigh script
Latin: Latin transliteration
Arabic: Arabic script representation
Created on: 2026-02-07
- Downloads last month
- 154