Tamazight Language Model with SentencePiece BPE Tokenizer

This is a custom Transformer-based language model trained from scratch on Tamazight (Berber) text with a SentencePiece BPE tokenizer.

Model Details

Architecture

  • Model Type: Transformer-based Language Model
  • Parameters: 98,299,392
  • Hidden Size: 568
  • Attention Heads: 12
  • Layers: 12
  • Maximum Sequence Length: 256
  • Vocabulary Size: 8,000

Tokenizer

  • Type: SentencePiece BPE (Byte-Pair Encoding)
  • Vocabulary Size: 8,000
  • Special Tokens: <s>, </s>, <unk>, <pad>
  • Scripts Supported: Tifinagh, Latin, Arabic

Usage

Loading the Model

import sentencepiece as spm
import torch

# Load tokenizer
sp = spm.SentencePieceProcessor(model_file='tokenizer.model')

# Load model
model = torch.load('pytorch_model.bin', map_location='cpu')

Tokenizing Text
   Copied # Encode
text = "Asalam"
pieces = sp.encode_as_pieces(text)
ids = sp.encode_as_ids(text)

# Decode
decoded = sp.decode_ids(ids) Files

pytorch_model.bin: Model weights
config.json: Model configuration
tokenizer.model: SentencePiece BPE tokenizer
tokenizer_config.json: Tokenizer configuration
special_tokens_map.json: Special tokens mapping

Languages
Supports Tamazight text in multiple scripts:

Tifinagh: Traditional Amazigh script
Latin: Latin transliteration
Arabic: Arabic script representation

Created on: 2026-02-07
Downloads last month
154
Inference Examples
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using Velkamez/tamazight-sp-bpe 1