Instructions to use Ephraimmm/pidgin14-decoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Ephraimmm/pidgin14-decoder with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Ephraimmm/pidgin14-decoder")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Ephraimmm/pidgin14-decoder", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Ephraimmm/pidgin14-decoder with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Ephraimmm/pidgin14-decoder" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ephraimmm/pidgin14-decoder", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Ephraimmm/pidgin14-decoder
- SGLang
How to use Ephraimmm/pidgin14-decoder with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Ephraimmm/pidgin14-decoder" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ephraimmm/pidgin14-decoder", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Ephraimmm/pidgin14-decoder" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ephraimmm/pidgin14-decoder", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Ephraimmm/pidgin14-decoder with Docker Model Runner:
docker model run hf.co/Ephraimmm/pidgin14-decoder
Pidgin14 Decoder (GPT-2-medium-based)
Overview
This repository hosts the decoder-side tokenizer for pidgin14, an encoder-decoder sequence-to-sequence system for Nigerian Pidgin English ("Naija") built by Ephraim at Analytics Intelligence.
pidgin14 is composed of two halves published as separate repositories:
- Encoder โ
Ephraimmm/pidgin14-encoder, based on AfriBERTa, reads source text and produces contextual representations. - Decoder (this repo) โ based on GPT-2-medium, consumes the encoder's representations via cross-attention and generates the output text.
The two halves are combined and trained together as a single EncoderDecoderModel, whose full weights are published at Ephraimmm/pidgin14. The architecture facts below are taken directly from that combined model's config.json (decoder sub-config), since this component repository itself contains only tokenizer files (tokenizer.json, tokenizer_config.json, special_tokens_map.json, vocab.json, merges.txt) and not a standalone config.json or weight file.
Architecture Details
From the decoder sub-configuration of the combined Ephraimmm/pidgin14 model:
| Field | Value |
|---|---|
| Base model | gpt2-medium |
| Model type | gpt2 (architecture class GPT2LMHeadModel), configured with add_cross_attention: true so it can act as the decoder half of an EncoderDecoderModel |
Layers (n_layer) |
24 |
Hidden size (n_embd) |
1024 |
Attention heads (n_head) |
16 |
Context length (n_positions / n_ctx) |
1024 |
| Vocabulary size | 50,257 |
| Activation function | gelu_new |
Tokenizer shipped in this repository:
- Tokenizer class:
GPT2Tokenizer(byte-level BPE) - Vocabulary size: 50,257 tokens (
vocab.jsonwith 50,000 merge rules inmerges.txt) โ this matches the standard, unmodified GPT-2 tokenizer vocabulary rather than a Pidgin-specific retrained vocabulary. - Special token:
<|endoftext|>used as bos/eos/pad/unk (token id 50256). decoder_start_token_id: 50256 (per the combined model's config).
Training Details
- Fine-tuned from:
gpt2-medium, used as the decoder half of thepidgin14EncoderDecoderModel(with cross-attention layers added to attend to the encoder's outputs). - Framework: Hugging Face
transformers(the combined model's config recordstransformers_version: 4.44.2). - Stored precision:
float32(per the combined model's config). - No
trainer_state.json, training-step/epoch counts, optimizer settings, or training-dataset identifiers are published in this repository or in the combinedEphraimmm/pidgin14repository. These details are therefore omitted rather than estimated.
Intended Use
- Generating Nigerian Pidgin English and/or English text as the second stage of the
pidgin14sequence-to-sequence pipeline (e.g. translation, paraphrasing, conversational response generation). - Research and experimentation on low-resource West African language NLP.
- Must be paired with the
pidgin14-encodertokenizer and the trained weights inEphraimmm/pidgin14to produce output.
How to Use
from transformers import AutoTokenizer, EncoderDecoderModel
# Tokenizers for each half of the system
encoder_tokenizer = AutoTokenizer.from_pretrained("Ephraimmm/pidgin14-encoder")
decoder_tokenizer = AutoTokenizer.from_pretrained("Ephraimmm/pidgin14-decoder")
# The trained combined encoder-decoder weights
model = EncoderDecoderModel.from_pretrained("Ephraimmm/pidgin14")
text = "How you dey?"
inputs = encoder_tokenizer(text, return_tensors="pt")
output_ids = model.generate(
**inputs,
decoder_start_token_id=decoder_tokenizer.bos_token_id,
max_length=50,
)
print(decoder_tokenizer.decode(output_ids[0], skip_special_tokens=True))
Limitations
- This repository provides the tokenizer only for the decoder half of
pidgin14; it is not a usable standalone model and contains no weight file orconfig.jsonof its own. - Must be paired with
Ephraimmm/pidgin14-encoderand the weights inEphraimmm/pidgin14to perform any task. - The tokenizer vocabulary is the stock GPT-2 (English-oriented) byte-level BPE vocabulary and was not retrained on Pidgin-specific text, which may reduce tokenization efficiency for Pidgin-specific spellings and slang.
- Nigerian Pidgin English is a low-resource language with substantial dialectal and orthographic variation; outputs should be reviewed for fluency and correctness before use.
- No evaluation metrics, benchmark results, or training-dataset documentation are published for this model. Outputs should be independently validated before any production use.
- License terms are not specified in the repository; users should contact the author before commercial reuse.
Author
Developed by Ephraimmm
docker model run hf.co/Ephraimmm/pidgin14-decoder