How to use from the
Use from the
Transformers library
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Ephraimmm/pidgin14-encoder", dtype="auto")
Quick Links

Pidgin14 Encoder (AfriBERTa-based)

Overview

This repository hosts the encoder-side tokenizer for pidgin14, an encoder-decoder sequence-to-sequence system for Nigerian Pidgin English ("Naija") built by Ephraim at Analytics Intelligence.

pidgin14 is composed of two halves published as separate repositories:

  • Encoder (this repo) โ€” based on AfriBERTa, reads source text and produces contextual representations.
  • Decoder โ€” Ephraimmm/pidgin14-decoder, based on GPT-2-medium, consumes the encoder's representations via cross-attention and generates output text.

The two halves are combined and trained together as a single EncoderDecoderModel, whose full weights are published at Ephraimmm/pidgin14. The architecture facts below are taken directly from that combined model's config.json (encoder sub-config), since this component repository itself contains only tokenizer files (tokenizer.json, tokenizer_config.json, special_tokens_map.json, sentencepiece.bpe.model) and not a standalone config.json or weight file.

Architecture Details

From the encoder sub-configuration of the combined Ephraimmm/pidgin14 model:

Field Value
Base model castorini/afriberta_small
Model type xlm-roberta (architecture class XLMRobertaForMaskedLM, used as the encoder half of an EncoderDecoderModel)
Hidden size 768
Hidden layers 4
Attention heads 6
Intermediate (feed-forward) size 3072
Max position embeddings 514
Hidden activation GELU
Vocabulary size 70,006

Tokenizer shipped in this repository:

  • Tokenizer class: XLMRobertaTokenizer
  • Underlying algorithm: SentencePiece Unigram model (sentencepiece.bpe.model)
  • Vocabulary size: 70,006 tokens
  • Special tokens: <s> (bos/cls), <pad>, </s> (eos/sep), <unk>, <mask>

Training Details

  • Fine-tuned from: castorini/afriberta_small, used as the encoder half of the pidgin14 EncoderDecoderModel.
  • Framework: Hugging Face transformers (the combined model's config records transformers_version: 4.44.2).
  • Stored precision: float32 (per the combined model's config).
  • No trainer_state.json, training-step/epoch counts, optimizer settings, or training-dataset identifiers are published in this repository or in the combined Ephraimmm/pidgin14 repository. These details are therefore omitted rather than estimated.

Intended Use

  • Encoding Nigerian Pidgin English and/or English text as the first stage of the pidgin14 sequence-to-sequence pipeline (e.g. translation, paraphrasing, conversational response generation).
  • Research and experimentation on low-resource West African language NLP.
  • Must be paired with the pidgin14-decoder tokenizer and the trained weights in Ephraimmm/pidgin14 to produce output.

How to Use

from transformers import AutoTokenizer, EncoderDecoderModel

# Tokenizers for each half of the system
encoder_tokenizer = AutoTokenizer.from_pretrained("Ephraimmm/pidgin14-encoder")
decoder_tokenizer = AutoTokenizer.from_pretrained("Ephraimmm/pidgin14-decoder")

# The trained combined encoder-decoder weights
model = EncoderDecoderModel.from_pretrained("Ephraimmm/pidgin14")

text = "How you dey?"
inputs = encoder_tokenizer(text, return_tensors="pt")

output_ids = model.generate(
    **inputs,
    decoder_start_token_id=decoder_tokenizer.bos_token_id,
    max_length=50,
)
print(decoder_tokenizer.decode(output_ids[0], skip_special_tokens=True))

Limitations

  • This repository provides the tokenizer only for the encoder half of pidgin14; it is not a usable standalone model and contains no weight file or config.json of its own.
  • Must be paired with Ephraimmm/pidgin14-decoder and the weights in Ephraimmm/pidgin14 to perform any task.
  • Nigerian Pidgin English is a low-resource language with substantial dialectal and orthographic variation; a fixed AfriBERTa-derived vocabulary may not fully capture all spelling variants encountered in real usage.
  • No evaluation metrics, benchmark results, or training-dataset documentation are published for this model. Outputs should be independently validated before any production use.
  • License terms are not specified in the repository; users should contact the author before commercial reuse.

Author

Developed by Ephraimmm

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support