modified_smollm / README.md
rnnandi's picture
Add modified SmolLM2 with Bangla tokenizer support
8b61e91 verified
metadata
license: apache-2.0
base_model: HuggingFaceTB/SmolLM2-135M
tags:
  - smollm2
  - bangla
  - bengali
  - multilingual
  - causal-lm
  - text-generation

Modified SmolLM2 with Bangla Tokenizer Support

This is a modified version of SmolLM2-135M that includes enhanced Bangla (Bengali) tokenizer support by merging tokens from TituLM.

Model Details

  • Base Model: HuggingFaceTB/SmolLM2-135M
  • Tokenizer Enhancement: Merged with TituLM Bangla tokenizer
  • Original Vocabulary Size: 49,152
  • Enhanced Vocabulary Size: 180,177
  • Added Tokens: ~131,025 Bangla-specific tokens

Key Features

  • ✅ Full SmolLM2-135M model architecture
  • ✅ Enhanced Bangla tokenization support
  • ✅ Backward compatible with original SmolLM2
  • ✅ Improved performance on Bangla text

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the modified model
model = AutoModelForCausalLM.from_pretrained("rnnandi/modified_smollm")
tokenizer = AutoTokenizer.from_pretrained("rnnandi/modified_smollm")

# Test with Bangla text
text = "আমি বাংলায় গান গাই"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Training

This model was created by:

  1. Merging TituLM Bangla tokenizer with SmolLM2 tokenizer
  2. Resizing model embeddings to accommodate new vocabulary
  3. Preserving original model weights and architecture

Citation

If you use this model, please cite both the original SmolLM2 and TituLM:

@misc{smollm2,
  title={SmolLM2: A Family of Small Language Models},
  author={HuggingFace Team},
  year={2024},
  url={https://huggingface.co/HuggingFaceTB/SmolLM2-135M}
}

@misc{titulm,
  title={TituLM: A Bangla Language Model},
  author={Hishab Team},
  year={2024},
  url={https://huggingface.co/hishab/titulm-llama-3.2-1b-v2.0}
}

License

This model is released under the Apache 2.0 License, same as the base SmolLM2 model.