Pidgin14 Encoder (AfriBERTa-based)

Overview

This repository hosts the encoder-side tokenizer for pidgin14, an encoder-decoder sequence-to-sequence system for Nigerian Pidgin English ("Naija") built by Ephraim at Analytics Intelligence.

pidgin14 is composed of two halves published as separate repositories:

Encoder (this repo) — based on AfriBERTa, reads source text and produces contextual representations.
Decoder — Ephraimmm/pidgin14-decoder, based on GPT-2-medium, consumes the encoder's representations via cross-attention and generates output text.

The two halves are combined and trained together as a single EncoderDecoderModel, whose full weights are published at Ephraimmm/pidgin14. The architecture facts below are taken directly from that combined model's config.json (encoder sub-config), since this component repository itself contains only tokenizer files (tokenizer.json, tokenizer_config.json, special_tokens_map.json, sentencepiece.bpe.model) and not a standalone config.json or weight file.

Architecture Details

From the encoder sub-configuration of the combined Ephraimmm/pidgin14 model:

Field	Value
Base model	`castorini/afriberta_small`
Model type	`xlm-roberta` (architecture class `XLMRobertaForMaskedLM`, used as the encoder half of an `EncoderDecoderModel`)
Hidden size	768
Hidden layers	4
Attention heads	6
Intermediate (feed-forward) size	3072
Max position embeddings	514
Hidden activation	GELU
Vocabulary size	70,006

Tokenizer shipped in this repository:

Tokenizer class: XLMRobertaTokenizer
Underlying algorithm: SentencePiece Unigram model (sentencepiece.bpe.model)
Vocabulary size: 70,006 tokens
Special tokens: <s> (bos/cls), <pad>, </s> (eos/sep), <unk>, <mask>

Training Details

Fine-tuned from: castorini/afriberta_small, used as the encoder half of the pidgin14 EncoderDecoderModel.
Framework: Hugging Face transformers (the combined model's config records transformers_version: 4.44.2).
Stored precision: float32 (per the combined model's config).
No trainer_state.json, training-step/epoch counts, optimizer settings, or training-dataset identifiers are published in this repository or in the combined Ephraimmm/pidgin14 repository. These details are therefore omitted rather than estimated.

Intended Use

Encoding Nigerian Pidgin English and/or English text as the first stage of the pidgin14 sequence-to-sequence pipeline (e.g. translation, paraphrasing, conversational response generation).
Research and experimentation on low-resource West African language NLP.
Must be paired with the pidgin14-decoder tokenizer and the trained weights in Ephraimmm/pidgin14 to produce output.

How to Use

from transformers import AutoTokenizer, EncoderDecoderModel

# Tokenizers for each half of the system
encoder_tokenizer = AutoTokenizer.from_pretrained("Ephraimmm/pidgin14-encoder")
decoder_tokenizer = AutoTokenizer.from_pretrained("Ephraimmm/pidgin14-decoder")

# The trained combined encoder-decoder weights
model = EncoderDecoderModel.from_pretrained("Ephraimmm/pidgin14")

text = "How you dey?"
inputs = encoder_tokenizer(text, return_tensors="pt")

output_ids = model.generate(
    **inputs,
    decoder_start_token_id=decoder_tokenizer.bos_token_id,
    max_length=50,
)
print(decoder_tokenizer.decode(output_ids[0], skip_special_tokens=True))

Limitations

This repository provides the tokenizer only for the encoder half of pidgin14; it is not a usable standalone model and contains no weight file or config.json of its own.
Must be paired with Ephraimmm/pidgin14-decoder and the weights in Ephraimmm/pidgin14 to perform any task.
Nigerian Pidgin English is a low-resource language with substantial dialectal and orthographic variation; a fixed AfriBERTa-derived vocabulary may not fully capture all spelling variants encountered in real usage.
No evaluation metrics, benchmark results, or training-dataset documentation are published for this model. Outputs should be independently validated before any production use.
License terms are not specified in the repository; users should contact the author before commercial reuse.

Author

Developed by Ephraimmm

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support