Pidgin14 Decoder (GPT-2-medium-based)

Overview

This repository hosts the decoder-side tokenizer for pidgin14, an encoder-decoder sequence-to-sequence system for Nigerian Pidgin English ("Naija") built by Ephraim at Analytics Intelligence.

pidgin14 is composed of two halves published as separate repositories:

Encoder — Ephraimmm/pidgin14-encoder, based on AfriBERTa, reads source text and produces contextual representations.
Decoder (this repo) — based on GPT-2-medium, consumes the encoder's representations via cross-attention and generates the output text.

The two halves are combined and trained together as a single EncoderDecoderModel, whose full weights are published at Ephraimmm/pidgin14. The architecture facts below are taken directly from that combined model's config.json (decoder sub-config), since this component repository itself contains only tokenizer files (tokenizer.json, tokenizer_config.json, special_tokens_map.json, vocab.json, merges.txt) and not a standalone config.json or weight file.

Architecture Details

From the decoder sub-configuration of the combined Ephraimmm/pidgin14 model:

Field	Value
Base model	`gpt2-medium`
Model type	`gpt2` (architecture class `GPT2LMHeadModel`), configured with `add_cross_attention: true` so it can act as the decoder half of an `EncoderDecoderModel`
Layers (`n_layer`)	24
Hidden size (`n_embd`)	1024
Attention heads (`n_head`)	16
Context length (`n_positions` / `n_ctx`)	1024
Vocabulary size	50,257
Activation function	`gelu_new`

Tokenizer shipped in this repository:

Tokenizer class: GPT2Tokenizer (byte-level BPE)
Vocabulary size: 50,257 tokens (vocab.json with 50,000 merge rules in merges.txt) — this matches the standard, unmodified GPT-2 tokenizer vocabulary rather than a Pidgin-specific retrained vocabulary.
Special token: <|endoftext|> used as bos/eos/pad/unk (token id 50256).
decoder_start_token_id: 50256 (per the combined model's config).

Training Details

Fine-tuned from: gpt2-medium, used as the decoder half of the pidgin14 EncoderDecoderModel (with cross-attention layers added to attend to the encoder's outputs).
Framework: Hugging Face transformers (the combined model's config records transformers_version: 4.44.2).
Stored precision: float32 (per the combined model's config).
No trainer_state.json, training-step/epoch counts, optimizer settings, or training-dataset identifiers are published in this repository or in the combined Ephraimmm/pidgin14 repository. These details are therefore omitted rather than estimated.

Intended Use

Generating Nigerian Pidgin English and/or English text as the second stage of the pidgin14 sequence-to-sequence pipeline (e.g. translation, paraphrasing, conversational response generation).
Research and experimentation on low-resource West African language NLP.
Must be paired with the pidgin14-encoder tokenizer and the trained weights in Ephraimmm/pidgin14 to produce output.

How to Use

from transformers import AutoTokenizer, EncoderDecoderModel

# Tokenizers for each half of the system
encoder_tokenizer = AutoTokenizer.from_pretrained("Ephraimmm/pidgin14-encoder")
decoder_tokenizer = AutoTokenizer.from_pretrained("Ephraimmm/pidgin14-decoder")

# The trained combined encoder-decoder weights
model = EncoderDecoderModel.from_pretrained("Ephraimmm/pidgin14")

text = "How you dey?"
inputs = encoder_tokenizer(text, return_tensors="pt")

output_ids = model.generate(
    **inputs,
    decoder_start_token_id=decoder_tokenizer.bos_token_id,
    max_length=50,
)
print(decoder_tokenizer.decode(output_ids[0], skip_special_tokens=True))

Limitations

This repository provides the tokenizer only for the decoder half of pidgin14; it is not a usable standalone model and contains no weight file or config.json of its own.
Must be paired with Ephraimmm/pidgin14-encoder and the weights in Ephraimmm/pidgin14 to perform any task.
The tokenizer vocabulary is the stock GPT-2 (English-oriented) byte-level BPE vocabulary and was not retrained on Pidgin-specific text, which may reduce tokenization efficiency for Pidgin-specific spellings and slang.
Nigerian Pidgin English is a low-resource language with substantial dialectal and orthographic variation; outputs should be reviewed for fluency and correctness before use.
No evaluation metrics, benchmark results, or training-dataset documentation are published for this model. Outputs should be independently validated before any production use.
License terms are not specified in the repository; users should contact the author before commercial reuse.

Author

Developed by Ephraimmm

Downloads last month: -; Downloads are not tracked for this model. How to track