modified_smollm / README.md

Add modified SmolLM2 with Bangla tokenizer support

8b61e91 verified 5 months ago

2.05 kB

	---
	license: apache-2.0
	base_model: HuggingFaceTB/SmolLM2-135M
	tags:
	- smollm2
	- bangla
	- bengali
	- multilingual
	- causal-lm
	- text-generation
	---

	# Modified SmolLM2 with Bangla Tokenizer Support

	This is a modified version of SmolLM2-135M that includes enhanced Bangla (Bengali) tokenizer support by merging tokens from TituLM.

	## Model Details

	- Base Model: HuggingFaceTB/SmolLM2-135M
	- Tokenizer Enhancement: Merged with TituLM Bangla tokenizer
	- Original Vocabulary Size: 49,152
	- Enhanced Vocabulary Size: 180,177
	- Added Tokens: ~131,025 Bangla-specific tokens

	## Key Features

	- ✅ Full SmolLM2-135M model architecture
	- ✅ Enhanced Bangla tokenization support
	- ✅ Backward compatible with original SmolLM2
	- ✅ Improved performance on Bangla text

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# Load the modified model
	model = AutoModelForCausalLM.from_pretrained("rnnandi/modified_smollm")
	tokenizer = AutoTokenizer.from_pretrained("rnnandi/modified_smollm")

	# Test with Bangla text
	text = "আমি বাংলায় গান গাই"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model.generate(**inputs, max_length=50)
	result = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(result)
	```

	## Training

	This model was created by:
	1. Merging TituLM Bangla tokenizer with SmolLM2 tokenizer
	2. Resizing model embeddings to accommodate new vocabulary
	3. Preserving original model weights and architecture

	## Citation

	If you use this model, please cite both the original SmolLM2 and TituLM:

	```bibtex
	@misc{smollm2,
	title={SmolLM2: A Family of Small Language Models},
	author={HuggingFace Team},
	year={2024},
	url={https://huggingface.co/HuggingFaceTB/SmolLM2-135M}
	}

	@misc{titulm,
	title={TituLM: A Bangla Language Model},
	author={Hishab Team},
	year={2024},
	url={https://huggingface.co/hishab/titulm-llama-3.2-1b-v2.0}
	}
	```

	## License

	This model is released under the Apache 2.0 License, same as the base SmolLM2 model.