File size: 3,466 Bytes
09caf4b 530ce92 09caf4b 530ce92 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
---
license: mit
datasets:
- eligapris/kirundi-english
language:
- rn
library_name: transformers
tags:
- kirundi
- rn
---
# Kirundi Tokenizer and LoRA Model
## Model Description
This repository contains two main components:
1. A BPE tokenizer trained specifically for the Kirundi language (ISO code: run)
2. A LoRA adapter trained for Kirundi language processing
### Tokenizer Details
- **Type**: BPE (Byte-Pair Encoding)
- **Vocabulary Size**: 30,000 tokens
- **Special Tokens**: [UNK], [CLS], [SEP], [PAD], [MASK]
- **Pre-tokenization**: Whitespace-based
### LoRA Adapter Details
- **Base Model**: [To be filled with your chosen base model]
- **Rank**: 8
- **Alpha**: 32
- **Target Modules**: Query and Value attention matrices
- **Dropout**: 0.05
## Intended Uses & Limitations
### Intended Uses
- Text processing for Kirundi language
- Machine translation tasks involving Kirundi
- Natural language understanding tasks for Kirundi content
- Foundation for developing Kirundi language applications
### Limitations
- The tokenizer is trained on a specific corpus and may not cover all Kirundi dialects
- Limited to the vocabulary observed in the training data
- Performance may vary on domain-specific text
## Training Data
The model components were trained on the Kirundi-English parallel corpus:
- **Dataset**: eligapris/kirundi-english
- **Size**: 21.4k sentence pairs
- **Nature**: Parallel corpus with Kirundi and English translations
- **Domain**: Mixed domain including religious, general, and conversational text
## Training Procedure
### Tokenizer Training
- Trained using Hugging Face's Tokenizers library
- BPE algorithm with a vocabulary size of 30k
- Includes special tokens for task-specific usage
- Trained on the Kirundi portion of the parallel corpus
### LoRA Training
[To be filled with your specific training details]
- Number of epochs:
- Batch size:
- Learning rate:
- Training hardware:
- Training time:
## Evaluation Results
[To be filled with your evaluation metrics]
- Coverage statistics:
- Out-of-vocabulary rate:
- Task-specific metrics:
## Environmental Impact
[To be filled with training compute details]
- Estimated CO2 emissions:
- Hardware used:
- Training duration:
## Technical Specifications
### Model Architecture
- Tokenizer: BPE-based with custom vocabulary
- LoRA Configuration:
- r=8 (rank)
- α=32 (scaling)
- Trained on specific attention layers
- Dropout rate: 0.05
### Software Requirements
```python
dependencies = {
"transformers": ">=4.30.0",
"tokenizers": ">=0.13.0",
"peft": ">=0.4.0"
}
```
## How to Use
### Loading the Tokenizer
```python
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("path_to_tokenizer")
```
### Loading the LoRA Model
```python
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSequenceClassification
config = PeftConfig.from_pretrained("path_to_lora_model")
model = AutoModelForSequenceClassification.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, "path_to_lora_model")
```
## Contact
Eligapris
---
## Updates and Versions
- v1.0.0 (Initial Release)
- Base tokenizer and LoRA model
- Trained on Kirundi-English parallel corpus
- Basic functionality and documentation
## Acknowledgments
- Dataset provided by eligapris
- Hugging Face's Transformers and Tokenizers libraries
- PEFT library for LoRA implementation |