Tamazight Language Model with SentencePiece BPE Tokenizer

This is a custom Transformer-based language model trained from scratch on Tamazight (Berber) text with a SentencePiece BPE tokenizer.

Model Details

Architecture

Model Type: Transformer-based Language Model
Parameters: 98,299,392
Hidden Size: 568
Attention Heads: 12
Layers: 12
Maximum Sequence Length: 256
Vocabulary Size: 8,000

Tokenizer

Type: SentencePiece BPE (Byte-Pair Encoding)
Vocabulary Size: 8,000
Special Tokens: <s>, </s>, <unk>, <pad>
Scripts Supported: Tifinagh, Latin, Arabic

Usage

Loading the Model

import sentencepiece as spm
import torch

# Load tokenizer
sp = spm.SentencePieceProcessor(model_file='tokenizer.model')

# Load model
model = torch.load('pytorch_model.bin', map_location='cpu')

Tokenizing Text
   Copied # Encode
text = "Asalam"
pieces = sp.encode_as_pieces(text)
ids = sp.encode_as_ids(text)

# Decode
decoded = sp.decode_ids(ids) Files

pytorch_model.bin: Model weights
config.json: Model configuration
tokenizer.model: SentencePiece BPE tokenizer
tokenizer_config.json: Tokenizer configuration
special_tokens_map.json: Special tokens mapping

Languages
Supports Tamazight text in multiple scripts:

Tifinagh: Traditional Amazigh script
Latin: Latin transliteration
Arabic: Arabic script representation

Created on: 2026-02-07

Downloads last month: 7

Safetensors

Model size

98.3M params

Tensor type

F32

Velkamez
/

tamazight-sp-bpe

Tamazight Language Model with SentencePiece BPE Tokenizer

Model Details

Architecture

Tokenizer

Usage

Loading the Model

Space using Velkamez/tamazight-sp-bpe 1