eligapris
/

rn-tokenizer

Transformers

Kirundi

bert

Model card Files Files and versions

xet

Community

eligapris commited on Dec 6, 2024

Commit

b01aef0

verified ·

1 Parent(s): 94f675f

Update README.md

Browse files

Files changed (1) hide show

README.md +93 -69

README.md CHANGED Viewed

@@ -1,10 +1,16 @@
-# Kirundi Tokenizer and LoRA Model
 ## Model Description
-This repository contains two main components:
-1. A BPE tokenizer trained specifically for the Kirundi language (ISO code: run)
-2. A LoRA adapter trained for Kirundi language processing
 ### Tokenizer Details
 - **Type**: BPE (Byte-Pair Encoding)
@@ -12,19 +18,11 @@ This repository contains two main components:
 - **Special Tokens**: [UNK], [CLS], [SEP], [PAD], [MASK]
 - **Pre-tokenization**: Whitespace-based
-### LoRA Adapter Details
-- **Base Model**: [To be filled with your chosen base model]
-- **Rank**: 8
-- **Alpha**: 32
-- **Target Modules**: Query and Value attention matrices
-- **Dropout**: 0.05
 ## Intended Uses & Limitations
 ### Intended Uses
 - Text processing for Kirundi language
-- Machine translation tasks involving Kirundi
-- Natural language understanding tasks for Kirundi content
 - Foundation for developing Kirundi language applications
 ### Limitations
@@ -34,103 +32,129 @@ This repository contains two main components:
 ## Training Data
-The model components were trained on the Kirundi-English parallel corpus:
 - **Dataset**: eligapris/kirundi-english
 - **Size**: 21.4k sentence pairs
 - **Nature**: Parallel corpus with Kirundi and English translations
 - **Domain**: Mixed domain including religious, general, and conversational text
-## Training Procedure
-### Tokenizer Training
-- Trained using Hugging Face's Tokenizers library
-- BPE algorithm with a vocabulary size of 30k
-- Includes special tokens for task-specific usage
-- Trained on the Kirundi portion of the parallel corpus
-### LoRA Training
-[To be filled with your specific training details]
-- Number of epochs:
-- Batch size:
-- Learning rate:
-- Training hardware:
-- Training time:
-## Evaluation Results
-[To be filled with your evaluation metrics]
-- Coverage statistics:
-- Out-of-vocabulary rate:
-- Task-specific metrics:
-## Environmental Impact
-[To be filled with training compute details]
-- Estimated CO2 emissions:
-- Hardware used:
-- Training duration:
-## Technical Specifications
-### Model Architecture
-- Tokenizer: BPE-based with custom vocabulary
-- LoRA Configuration:
-  - r=8 (rank)
-  - α=32 (scaling)
-  - Trained on specific attention layers
-  - Dropout rate: 0.05
-### Software Requirements
 ```python
-dependencies = {
-    "transformers": ">=4.30.0",
-    "tokenizers": ">=0.13.0",
-    "peft": ">=0.4.0"
-}
 ```
-## How to Use
-### Loading the Tokenizer
 ```python
-from transformers import PreTrainedTokenizerFast
-tokenizer = PreTrainedTokenizerFast.from_pretrained("path_to_tokenizer")
 ```
-### Loading the LoRA Model
 ```python
-from peft import PeftModel, PeftConfig
-from transformers import AutoModelForSequenceClassification
-config = PeftConfig.from_pretrained("path_to_lora_model")
-model = AutoModelForSequenceClassification.from_pretrained(config.base_model_name_or_path)
-model = PeftModel.from_pretrained(model, "path_to_lora_model")
 ```
-## Citation
-[To be filled with your preferred citation format]
-## License
-[Specify your chosen license]
 ## Contact
-[Your contact information or preferred method of contact]
 ---
 ## Updates and Versions
 - v1.0.0 (Initial Release)
-  - Base tokenizer and LoRA model
   - Trained on Kirundi-English parallel corpus
   - Basic functionality and documentation
 ## Acknowledgments
 - Dataset provided by eligapris
-- Hugging Face's Transformers and Tokenizers libraries
-- PEFT library for LoRA implementation

+---
+license: mit
+datasets:
+- eligapris/kirundi-english
+language:
+- rn
+library_name: transformers
+---
+# eligapris/rn-tokenizer
 ## Model Description
+This repository contains a BPE tokenizer trained specifically for the Kirundi language (ISO code: run).
 ### Tokenizer Details
 - **Type**: BPE (Byte-Pair Encoding)
 - **Special Tokens**: [UNK], [CLS], [SEP], [PAD], [MASK]
 - **Pre-tokenization**: Whitespace-based
 ## Intended Uses & Limitations
 ### Intended Uses
 - Text processing for Kirundi language
+- Pre-processing for NLP tasks involving Kirundi
 - Foundation for developing Kirundi language applications
 ### Limitations
 ## Training Data
+The tokenizer was trained on the Kirundi-English parallel corpus:
 - **Dataset**: eligapris/kirundi-english
 - **Size**: 21.4k sentence pairs
 - **Nature**: Parallel corpus with Kirundi and English translations
 - **Domain**: Mixed domain including religious, general, and conversational text
+## Installation
+You can use this tokenizer in your project by first installing the required dependencies:
+```bash
+pip install transformers
+```
+Then load the tokenizer directly from the Hugging Face Hub:
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("your-username/kirundi-tokenizer")
+```
+Or if you have downloaded the tokenizer files locally:
+```python
+from transformers import PreTrainedTokenizerFast
+tokenizer = PreTrainedTokenizerFast(tokenizer_file="kirundi_tokenizer.json")
+```
+## Usage Examples
+### Loading and Using the Tokenizer
+You can load the tokenizer in two ways:
 ```python
+# Method 1: Using AutoTokenizer (recommended)
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("your-username/kirundi-tokenizer")
+# Method 2: Using PreTrainedTokenizerFast with local file
+from transformers import PreTrainedTokenizerFast
+tokenizer = PreTrainedTokenizerFast(tokenizer_file="kirundi_tokenizer.json")
 ```
+#### Basic Usage Examples
+1. Tokenize a single sentence:
 ```python
+# Basic tokenization
+text = "ab'umudugudu hafi ya bose bateranira kumva ijambo ry'Imana."
+encoded = tokenizer(text)
+print(f"Input IDs: {encoded['input_ids']}")
+print(f"Tokens: {tokenizer.convert_ids_to_tokens(encoded['input_ids'])}")
+```
+2. Batch tokenization:
+```python
+# Process multiple sentences at once
+texts = [
+    "ifumbire mvaruganda.",
+    "aba azi gukora kandi afite ubushobozi"
+]
+encoded = tokenizer(texts, padding=True, truncation=True)
+print("Batch encoding:", encoded)
+```
+3. Get token IDs with special tokens:
+```python
+# Add special tokens like [CLS] and [SEP]
+encoded = tokenizer(text, add_special_tokens=True)
+tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'])
+print(f"Tokens with special tokens: {tokens}")
 ```
+4. Decode tokenized text:
 ```python
+# Convert token IDs back to text
+ids = encoded['input_ids']
+decoded_text = tokenizer.decode(ids)
+print(f"Decoded text: {decoded_text}")
+```
+5. Padding and truncation:
+```python
+# Pad or truncate sequences to a specific length
+encoded = tokenizer(
+    texts,
+    padding='max_length',
+    max_length=32,
+    truncation=True,
+    return_tensors='pt'  # Return PyTorch tensors
+)
+print("Padded sequences:", encoded['input_ids'].shape)
 ```
+## Future Development
+This tokenizer is intended to serve as a foundation for future Kirundi language model development, including potential fine-tuning with techniques like LoRA (Low-Rank Adaptation).
+## Technical Specifications
+### Software Requirements
+```python
+dependencies = {
+    "transformers": ">=4.30.0",
+    "tokenizers": ">=0.13.0"
+}
+```
 ## Contact
+eligrapris
 ---
 ## Updates and Versions
 - v1.0.0 (Initial Release)
+  - Base tokenizer implementation
   - Trained on Kirundi-English parallel corpus
   - Basic functionality and documentation
 ## Acknowledgments
 - Dataset provided by eligapris
+- Hugging Face's Transformers and Tokenizers libraries