Yoruba Constituency Parser (Fine-tuned T5 Model)
Overview
This repository hosts a transformer-based constituency parser fine-tuned on the manually annotated Yoruba Constituency Treebank (Version 1.0). The model is designed to automatically generate phrase-structure trees for Yoruba sentences, supporting both linguistic research and NLP applications.
The parser is built on a T5 architecture and was fine-tuned to understand Yoruba syntax, including:
- Serial Verb Constructions (SVCs)
- Focus constructions
- Embedded complement clauses
- Relative clauses
- Clause chaining
This model is intended for academic use, syntactic analysis, and computational research in Yoruba language processing.
Model Files
| File | Description |
|---|---|
config.json |
Model architecture and configuration settings. |
pytorch_model.bin or model.safetensors |
Trained model weights. |
tokenizer.json or tokenizer.model |
Tokenization rules for Yoruba sentences. |
tokenizer_config.json |
Tokenizer settings and special rules. |
special_tokens_map.json |
Maps special tokens (e.g., <pad>, <eos>). |
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load the fine-tuned Yoruba parser
model_name_or_path = "YOUR-HF-USERNAME/yoruba-constituency-parser"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
# Parse a sample Yoruba sentence
sentence = "Mo ra aso tuntun"
inputs = tokenizer(sentence, return_tensors="pt")
outputs = model.generate(**inputs)
parsed_tree = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(parsed_tree)
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support