--- license: apache-2.0 base_model: HuggingFaceTB/SmolLM2-135M tags: - smollm2 - bangla - bengali - multilingual - causal-lm - text-generation --- # Modified SmolLM2 with Bangla Tokenizer Support This is a modified version of SmolLM2-135M that includes enhanced Bangla (Bengali) tokenizer support by merging tokens from TituLM. ## Model Details - **Base Model**: HuggingFaceTB/SmolLM2-135M - **Tokenizer Enhancement**: Merged with TituLM Bangla tokenizer - **Original Vocabulary Size**: 49,152 - **Enhanced Vocabulary Size**: 180,177 - **Added Tokens**: ~131,025 Bangla-specific tokens ## Key Features - ✅ Full SmolLM2-135M model architecture - ✅ Enhanced Bangla tokenization support - ✅ Backward compatible with original SmolLM2 - ✅ Improved performance on Bangla text ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM # Load the modified model model = AutoModelForCausalLM.from_pretrained("rnnandi/modified_smollm") tokenizer = AutoTokenizer.from_pretrained("rnnandi/modified_smollm") # Test with Bangla text text = "আমি বাংলায় গান গাই" inputs = tokenizer(text, return_tensors="pt") outputs = model.generate(**inputs, max_length=50) result = tokenizer.decode(outputs[0], skip_special_tokens=True) print(result) ``` ## Training This model was created by: 1. Merging TituLM Bangla tokenizer with SmolLM2 tokenizer 2. Resizing model embeddings to accommodate new vocabulary 3. Preserving original model weights and architecture ## Citation If you use this model, please cite both the original SmolLM2 and TituLM: ```bibtex @misc{smollm2, title={SmolLM2: A Family of Small Language Models}, author={HuggingFace Team}, year={2024}, url={https://huggingface.co/HuggingFaceTB/SmolLM2-135M} } @misc{titulm, title={TituLM: A Bangla Language Model}, author={Hishab Team}, year={2024}, url={https://huggingface.co/hishab/titulm-llama-3.2-1b-v2.0} } ``` ## License This model is released under the Apache 2.0 License, same as the base SmolLM2 model.