File size: 2,045 Bytes

8b61e91

---
license: apache-2.0
base_model: HuggingFaceTB/SmolLM2-135M
tags:
- smollm2
- bangla
- bengali
- multilingual
- causal-lm
- text-generation
---

# Modified SmolLM2 with Bangla Tokenizer Support

This is a modified version of SmolLM2-135M that includes enhanced Bangla (Bengali) tokenizer support by merging tokens from TituLM.

## Model Details

- **Base Model**: HuggingFaceTB/SmolLM2-135M
- **Tokenizer Enhancement**: Merged with TituLM Bangla tokenizer
- **Original Vocabulary Size**: 49,152
- **Enhanced Vocabulary Size**: 180,177
- **Added Tokens**: ~131,025 Bangla-specific tokens

## Key Features

- ✅ Full SmolLM2-135M model architecture
- ✅ Enhanced Bangla tokenization support
- ✅ Backward compatible with original SmolLM2
- ✅ Improved performance on Bangla text

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the modified model
model = AutoModelForCausalLM.from_pretrained("rnnandi/modified_smollm")
tokenizer = AutoTokenizer.from_pretrained("rnnandi/modified_smollm")

# Test with Bangla text
text = "আমি বাংলায় গান গাই"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
```

## Training

This model was created by:
1. Merging TituLM Bangla tokenizer with SmolLM2 tokenizer
2. Resizing model embeddings to accommodate new vocabulary
3. Preserving original model weights and architecture

## Citation

If you use this model, please cite both the original SmolLM2 and TituLM:

```bibtex
@misc{smollm2,
  title={SmolLM2: A Family of Small Language Models},
  author={HuggingFace Team},
  year={2024},
  url={https://huggingface.co/HuggingFaceTB/SmolLM2-135M}
}

@misc{titulm,
  title={TituLM: A Bangla Language Model},
  author={Hishab Team},
  year={2024},
  url={https://huggingface.co/hishab/titulm-llama-3.2-1b-v2.0}
}
```

## License

This model is released under the Apache 2.0 License, same as the base SmolLM2 model.