Instructions to use Ephraimmm/pidgin14-encoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Ephraimmm/pidgin14-encoder with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Ephraimmm/pidgin14-encoder", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Pidgin14 Encoder (AfriBERTa-based)
Overview
This repository hosts the encoder-side tokenizer for pidgin14, an encoder-decoder sequence-to-sequence system for Nigerian Pidgin English ("Naija") built by Ephraim at Analytics Intelligence.
pidgin14 is composed of two halves published as separate repositories:
- Encoder (this repo) โ based on AfriBERTa, reads source text and produces contextual representations.
- Decoder โ
Ephraimmm/pidgin14-decoder, based on GPT-2-medium, consumes the encoder's representations via cross-attention and generates output text.
The two halves are combined and trained together as a single EncoderDecoderModel, whose full weights are published at Ephraimmm/pidgin14. The architecture facts below are taken directly from that combined model's config.json (encoder sub-config), since this component repository itself contains only tokenizer files (tokenizer.json, tokenizer_config.json, special_tokens_map.json, sentencepiece.bpe.model) and not a standalone config.json or weight file.
Architecture Details
From the encoder sub-configuration of the combined Ephraimmm/pidgin14 model:
| Field | Value |
|---|---|
| Base model | castorini/afriberta_small |
| Model type | xlm-roberta (architecture class XLMRobertaForMaskedLM, used as the encoder half of an EncoderDecoderModel) |
| Hidden size | 768 |
| Hidden layers | 4 |
| Attention heads | 6 |
| Intermediate (feed-forward) size | 3072 |
| Max position embeddings | 514 |
| Hidden activation | GELU |
| Vocabulary size | 70,006 |
Tokenizer shipped in this repository:
- Tokenizer class:
XLMRobertaTokenizer - Underlying algorithm: SentencePiece Unigram model (
sentencepiece.bpe.model) - Vocabulary size: 70,006 tokens
- Special tokens:
<s>(bos/cls),<pad>,</s>(eos/sep),<unk>,<mask>
Training Details
- Fine-tuned from:
castorini/afriberta_small, used as the encoder half of thepidgin14EncoderDecoderModel. - Framework: Hugging Face
transformers(the combined model's config recordstransformers_version: 4.44.2). - Stored precision:
float32(per the combined model's config). - No
trainer_state.json, training-step/epoch counts, optimizer settings, or training-dataset identifiers are published in this repository or in the combinedEphraimmm/pidgin14repository. These details are therefore omitted rather than estimated.
Intended Use
- Encoding Nigerian Pidgin English and/or English text as the first stage of the
pidgin14sequence-to-sequence pipeline (e.g. translation, paraphrasing, conversational response generation). - Research and experimentation on low-resource West African language NLP.
- Must be paired with the
pidgin14-decodertokenizer and the trained weights inEphraimmm/pidgin14to produce output.
How to Use
from transformers import AutoTokenizer, EncoderDecoderModel
# Tokenizers for each half of the system
encoder_tokenizer = AutoTokenizer.from_pretrained("Ephraimmm/pidgin14-encoder")
decoder_tokenizer = AutoTokenizer.from_pretrained("Ephraimmm/pidgin14-decoder")
# The trained combined encoder-decoder weights
model = EncoderDecoderModel.from_pretrained("Ephraimmm/pidgin14")
text = "How you dey?"
inputs = encoder_tokenizer(text, return_tensors="pt")
output_ids = model.generate(
**inputs,
decoder_start_token_id=decoder_tokenizer.bos_token_id,
max_length=50,
)
print(decoder_tokenizer.decode(output_ids[0], skip_special_tokens=True))
Limitations
- This repository provides the tokenizer only for the encoder half of
pidgin14; it is not a usable standalone model and contains no weight file orconfig.jsonof its own. - Must be paired with
Ephraimmm/pidgin14-decoderand the weights inEphraimmm/pidgin14to perform any task. - Nigerian Pidgin English is a low-resource language with substantial dialectal and orthographic variation; a fixed AfriBERTa-derived vocabulary may not fully capture all spelling variants encountered in real usage.
- No evaluation metrics, benchmark results, or training-dataset documentation are published for this model. Outputs should be independently validated before any production use.
- License terms are not specified in the repository; users should contact the author before commercial reuse.
Author
Developed by Ephraimmm
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Ephraimmm/pidgin14-encoder", dtype="auto")