File size: 2,045 Bytes
8b61e91
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
license: apache-2.0
base_model: HuggingFaceTB/SmolLM2-135M
tags:
- smollm2
- bangla
- bengali
- multilingual
- causal-lm
- text-generation
---

# Modified SmolLM2 with Bangla Tokenizer Support

This is a modified version of SmolLM2-135M that includes enhanced Bangla (Bengali) tokenizer support by merging tokens from TituLM.

## Model Details

- **Base Model**: HuggingFaceTB/SmolLM2-135M
- **Tokenizer Enhancement**: Merged with TituLM Bangla tokenizer
- **Original Vocabulary Size**: 49,152
- **Enhanced Vocabulary Size**: 180,177
- **Added Tokens**: ~131,025 Bangla-specific tokens

## Key Features

- ✅ Full SmolLM2-135M model architecture
- ✅ Enhanced Bangla tokenization support
- ✅ Backward compatible with original SmolLM2
- ✅ Improved performance on Bangla text

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the modified model
model = AutoModelForCausalLM.from_pretrained("rnnandi/modified_smollm")
tokenizer = AutoTokenizer.from_pretrained("rnnandi/modified_smollm")

# Test with Bangla text
text = "আমি বাংলায় গান গাই"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
```

## Training

This model was created by:
1. Merging TituLM Bangla tokenizer with SmolLM2 tokenizer
2. Resizing model embeddings to accommodate new vocabulary
3. Preserving original model weights and architecture

## Citation

If you use this model, please cite both the original SmolLM2 and TituLM:

```bibtex
@misc{smollm2,
  title={SmolLM2: A Family of Small Language Models},
  author={HuggingFace Team},
  year={2024},
  url={https://huggingface.co/HuggingFaceTB/SmolLM2-135M}
}

@misc{titulm,
  title={TituLM: A Bangla Language Model},
  author={Hishab Team},
  year={2024},
  url={https://huggingface.co/hishab/titulm-llama-3.2-1b-v2.0}
}
```

## License

This model is released under the Apache 2.0 License, same as the base SmolLM2 model.