How to use from
Docker Model Runner
docker model run hf.co/Ephraimmm/pidgin14-decoder
Quick Links

Pidgin14 Decoder (GPT-2-medium-based)

Overview

This repository hosts the decoder-side tokenizer for pidgin14, an encoder-decoder sequence-to-sequence system for Nigerian Pidgin English ("Naija") built by Ephraim at Analytics Intelligence.

pidgin14 is composed of two halves published as separate repositories:

  • Encoder โ€” Ephraimmm/pidgin14-encoder, based on AfriBERTa, reads source text and produces contextual representations.
  • Decoder (this repo) โ€” based on GPT-2-medium, consumes the encoder's representations via cross-attention and generates the output text.

The two halves are combined and trained together as a single EncoderDecoderModel, whose full weights are published at Ephraimmm/pidgin14. The architecture facts below are taken directly from that combined model's config.json (decoder sub-config), since this component repository itself contains only tokenizer files (tokenizer.json, tokenizer_config.json, special_tokens_map.json, vocab.json, merges.txt) and not a standalone config.json or weight file.

Architecture Details

From the decoder sub-configuration of the combined Ephraimmm/pidgin14 model:

Field Value
Base model gpt2-medium
Model type gpt2 (architecture class GPT2LMHeadModel), configured with add_cross_attention: true so it can act as the decoder half of an EncoderDecoderModel
Layers (n_layer) 24
Hidden size (n_embd) 1024
Attention heads (n_head) 16
Context length (n_positions / n_ctx) 1024
Vocabulary size 50,257
Activation function gelu_new

Tokenizer shipped in this repository:

  • Tokenizer class: GPT2Tokenizer (byte-level BPE)
  • Vocabulary size: 50,257 tokens (vocab.json with 50,000 merge rules in merges.txt) โ€” this matches the standard, unmodified GPT-2 tokenizer vocabulary rather than a Pidgin-specific retrained vocabulary.
  • Special token: <|endoftext|> used as bos/eos/pad/unk (token id 50256).
  • decoder_start_token_id: 50256 (per the combined model's config).

Training Details

  • Fine-tuned from: gpt2-medium, used as the decoder half of the pidgin14 EncoderDecoderModel (with cross-attention layers added to attend to the encoder's outputs).
  • Framework: Hugging Face transformers (the combined model's config records transformers_version: 4.44.2).
  • Stored precision: float32 (per the combined model's config).
  • No trainer_state.json, training-step/epoch counts, optimizer settings, or training-dataset identifiers are published in this repository or in the combined Ephraimmm/pidgin14 repository. These details are therefore omitted rather than estimated.

Intended Use

  • Generating Nigerian Pidgin English and/or English text as the second stage of the pidgin14 sequence-to-sequence pipeline (e.g. translation, paraphrasing, conversational response generation).
  • Research and experimentation on low-resource West African language NLP.
  • Must be paired with the pidgin14-encoder tokenizer and the trained weights in Ephraimmm/pidgin14 to produce output.

How to Use

from transformers import AutoTokenizer, EncoderDecoderModel

# Tokenizers for each half of the system
encoder_tokenizer = AutoTokenizer.from_pretrained("Ephraimmm/pidgin14-encoder")
decoder_tokenizer = AutoTokenizer.from_pretrained("Ephraimmm/pidgin14-decoder")

# The trained combined encoder-decoder weights
model = EncoderDecoderModel.from_pretrained("Ephraimmm/pidgin14")

text = "How you dey?"
inputs = encoder_tokenizer(text, return_tensors="pt")

output_ids = model.generate(
    **inputs,
    decoder_start_token_id=decoder_tokenizer.bos_token_id,
    max_length=50,
)
print(decoder_tokenizer.decode(output_ids[0], skip_special_tokens=True))

Limitations

  • This repository provides the tokenizer only for the decoder half of pidgin14; it is not a usable standalone model and contains no weight file or config.json of its own.
  • Must be paired with Ephraimmm/pidgin14-encoder and the weights in Ephraimmm/pidgin14 to perform any task.
  • The tokenizer vocabulary is the stock GPT-2 (English-oriented) byte-level BPE vocabulary and was not retrained on Pidgin-specific text, which may reduce tokenization efficiency for Pidgin-specific spellings and slang.
  • Nigerian Pidgin English is a low-resource language with substantial dialectal and orthographic variation; outputs should be reviewed for fluency and correctness before use.
  • No evaluation metrics, benchmark results, or training-dataset documentation are published for this model. Outputs should be independently validated before any production use.
  • License terms are not specified in the repository; users should contact the author before commercial reuse.

Author

Developed by Ephraimmm

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support