eligapris
/

rn-tokenizer

+# Kirundi Tokenizer and LoRA Model
+## Model Description
+This repository contains two main components:
+1. A BPE tokenizer trained specifically for the Kirundi language (ISO code: run)
+2. A LoRA adapter trained for Kirundi language processing
+### Tokenizer Details
+- **Type**: BPE (Byte-Pair Encoding)
+- **Vocabulary Size**: 30,000 tokens
+- **Special Tokens**: [UNK], [CLS], [SEP], [PAD], [MASK]
+- **Pre-tokenization**: Whitespace-based
+### LoRA Adapter Details
+- **Base Model**: [To be filled with your chosen base model]
+- **Rank**: 8
+- **Alpha**: 32
+- **Target Modules**: Query and Value attention matrices
+- **Dropout**: 0.05
+## Intended Uses & Limitations
+### Intended Uses
+- Text processing for Kirundi language
+- Machine translation tasks involving Kirundi
+- Natural language understanding tasks for Kirundi content
+- Foundation for developing Kirundi language applications
+### Limitations
+- The tokenizer is trained on a specific corpus and may not cover all Kirundi dialects
+- Limited to the vocabulary observed in the training data
+- Performance may vary on domain-specific text
+## Training Data
+The model components were trained on the Kirundi-English parallel corpus:
+- **Dataset**: eligapris/kirundi-english
+- **Size**: 21.4k sentence pairs
+- **Nature**: Parallel corpus with Kirundi and English translations
+- **Domain**: Mixed domain including religious, general, and conversational text
+## Training Procedure
+### Tokenizer Training
+- Trained using Hugging Face's Tokenizers library
+- BPE algorithm with a vocabulary size of 30k
+- Includes special tokens for task-specific usage
+- Trained on the Kirundi portion of the parallel corpus
+### LoRA Training
+[To be filled with your specific training details]
+- Number of epochs:
+- Batch size:
+- Learning rate:
+- Training hardware:
+- Training time:
+## Evaluation Results
+[To be filled with your evaluation metrics]
+- Coverage statistics:
+- Out-of-vocabulary rate:
+- Task-specific metrics:
+## Environmental Impact
+[To be filled with training compute details]
+- Estimated CO2 emissions:
+- Hardware used:
+- Training duration:
+## Technical Specifications
+### Model Architecture
+- Tokenizer: BPE-based with custom vocabulary
+- LoRA Configuration:
+  - r=8 (rank)
+  - α=32 (scaling)
+  - Trained on specific attention layers
+  - Dropout rate: 0.05
+### Software Requirements
+```python
+dependencies = {
+    "transformers": ">=4.30.0",
+    "tokenizers": ">=0.13.0",
+    "peft": ">=0.4.0"
+}
+```
+## How to Use
+### Loading the Tokenizer
+```python
+from transformers import PreTrainedTokenizerFast
+tokenizer = PreTrainedTokenizerFast.from_pretrained("path_to_tokenizer")
+```
+### Loading the LoRA Model
+```python
+from peft import PeftModel, PeftConfig
+from transformers import AutoModelForSequenceClassification
+config = PeftConfig.from_pretrained("path_to_lora_model")
+model = AutoModelForSequenceClassification.from_pretrained(config.base_model_name_or_path)
+model = PeftModel.from_pretrained(model, "path_to_lora_model")
+```
+## Citation
+[To be filled with your preferred citation format]
+## License
+[Specify your chosen license]
+## Contact
+[Your contact information or preferred method of contact]
+---
+## Updates and Versions
+- v1.0.0 (Initial Release)
+  - Base tokenizer and LoRA model
+  - Trained on Kirundi-English parallel corpus
+  - Basic functionality and documentation
+## Acknowledgments
+- Dataset provided by eligapris
+- Hugging Face's Transformers and Tokenizers libraries
+- PEFT library for LoRA implementation