Yoruba Constituency Parser (Fine-tuned T5 Model)
Overview
This repository hosts a transformer-based constituency parser fine-tuned on the manually annotated Yoruba Constituency Treebank (Version 1.0). The model is designed to automatically generate phrase-structure trees for Yoruba sentences, supporting both linguistic research and NLP applications.
The parser is built on a T5 architecture and was fine-tuned to understand Yoruba syntax, including:
- Serial Verb Constructions (SVCs)
- Focus constructions
- Embedded complement clauses
- Relative clauses
- Clause chaining
This model is intended for academic use, syntactic analysis, and computational research in Yoruba language processing.
Model Files
| File | Description |
|---|---|
config.json |
Model architecture and configuration settings. |
pytorch_model.bin or model.safetensors |
Trained model weights. |
tokenizer.json or tokenizer.model |
Tokenization rules for Yoruba sentences. |
tokenizer_config.json |
Tokenizer settings and special rules. |
special_tokens_map.json |
Maps special tokens (e.g., <pad>, <eos>). |
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load the fine-tuned Yoruba parser
model_name_or_path = "YOUR-HF-USERNAME/yoruba-constituency-parser"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
# Parse a sample Yoruba sentence
sentence = "Mo ra aso tuntun"
inputs = tokenizer(sentence, return_tensors="pt")
outputs = model.generate(**inputs)
parsed_tree = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(parsed_tree)
## Authors
Victoria A. Akindele
Department of Linguistics, University of Ibadan
Dr. Gerald Nweya
Department of Linguistics, University of Ibadan (Project Supervisor)
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support