|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: HuggingFaceTB/SmolLM2-135M |
|
|
tags: |
|
|
- smollm2 |
|
|
- bangla |
|
|
- bengali |
|
|
- multilingual |
|
|
- causal-lm |
|
|
- text-generation |
|
|
--- |
|
|
|
|
|
# Modified SmolLM2 with Bangla Tokenizer Support |
|
|
|
|
|
This is a modified version of SmolLM2-135M that includes enhanced Bangla (Bengali) tokenizer support by merging tokens from TituLM. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: HuggingFaceTB/SmolLM2-135M |
|
|
- **Tokenizer Enhancement**: Merged with TituLM Bangla tokenizer |
|
|
- **Original Vocabulary Size**: 49,152 |
|
|
- **Enhanced Vocabulary Size**: 180,177 |
|
|
- **Added Tokens**: ~131,025 Bangla-specific tokens |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- ✅ Full SmolLM2-135M model architecture |
|
|
- ✅ Enhanced Bangla tokenization support |
|
|
- ✅ Backward compatible with original SmolLM2 |
|
|
- ✅ Improved performance on Bangla text |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
# Load the modified model |
|
|
model = AutoModelForCausalLM.from_pretrained("rnnandi/modified_smollm") |
|
|
tokenizer = AutoTokenizer.from_pretrained("rnnandi/modified_smollm") |
|
|
|
|
|
# Test with Bangla text |
|
|
text = "আমি বাংলায় গান গাই" |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_length=50) |
|
|
result = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(result) |
|
|
``` |
|
|
|
|
|
## Training |
|
|
|
|
|
This model was created by: |
|
|
1. Merging TituLM Bangla tokenizer with SmolLM2 tokenizer |
|
|
2. Resizing model embeddings to accommodate new vocabulary |
|
|
3. Preserving original model weights and architecture |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite both the original SmolLM2 and TituLM: |
|
|
|
|
|
```bibtex |
|
|
@misc{smollm2, |
|
|
title={SmolLM2: A Family of Small Language Models}, |
|
|
author={HuggingFace Team}, |
|
|
year={2024}, |
|
|
url={https://huggingface.co/HuggingFaceTB/SmolLM2-135M} |
|
|
} |
|
|
|
|
|
@misc{titulm, |
|
|
title={TituLM: A Bangla Language Model}, |
|
|
author={Hishab Team}, |
|
|
year={2024}, |
|
|
url={https://huggingface.co/hishab/titulm-llama-3.2-1b-v2.0} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the Apache 2.0 License, same as the base SmolLM2 model. |
|
|
|