modified_smollm / README.md
rnnandi's picture
Add modified SmolLM2 with Bangla tokenizer support
8b61e91 verified
---
license: apache-2.0
base_model: HuggingFaceTB/SmolLM2-135M
tags:
- smollm2
- bangla
- bengali
- multilingual
- causal-lm
- text-generation
---
# Modified SmolLM2 with Bangla Tokenizer Support
This is a modified version of SmolLM2-135M that includes enhanced Bangla (Bengali) tokenizer support by merging tokens from TituLM.
## Model Details
- **Base Model**: HuggingFaceTB/SmolLM2-135M
- **Tokenizer Enhancement**: Merged with TituLM Bangla tokenizer
- **Original Vocabulary Size**: 49,152
- **Enhanced Vocabulary Size**: 180,177
- **Added Tokens**: ~131,025 Bangla-specific tokens
## Key Features
- ✅ Full SmolLM2-135M model architecture
- ✅ Enhanced Bangla tokenization support
- ✅ Backward compatible with original SmolLM2
- ✅ Improved performance on Bangla text
## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the modified model
model = AutoModelForCausalLM.from_pretrained("rnnandi/modified_smollm")
tokenizer = AutoTokenizer.from_pretrained("rnnandi/modified_smollm")
# Test with Bangla text
text = "আমি বাংলায় গান গাই"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
```
## Training
This model was created by:
1. Merging TituLM Bangla tokenizer with SmolLM2 tokenizer
2. Resizing model embeddings to accommodate new vocabulary
3. Preserving original model weights and architecture
## Citation
If you use this model, please cite both the original SmolLM2 and TituLM:
```bibtex
@misc{smollm2,
title={SmolLM2: A Family of Small Language Models},
author={HuggingFace Team},
year={2024},
url={https://huggingface.co/HuggingFaceTB/SmolLM2-135M}
}
@misc{titulm,
title={TituLM: A Bangla Language Model},
author={Hishab Team},
year={2024},
url={https://huggingface.co/hishab/titulm-llama-3.2-1b-v2.0}
}
```
## License
This model is released under the Apache 2.0 License, same as the base SmolLM2 model.