Translation
Collection
1 item
β’
Updated
A compact implementation of an Encoder-Decoder Transformer for sequence-to-sequence translation tasks. This project implements a translation model from English to Hindi using the Samanantar dataset.
Encoder:
Decoder:
Attention Mechanisms:
# Clone the repository
cd SmolTransformer
# Install dependencies
chmod +x install.sh
./install.sh
The model configuration can be modified in config.py:
@dataclass
class ModelArgs:
block_size: int = 512 # Maximum sequence length
batch_size: int = 32 # Training batch size
embeddings_dims: int = 512 # Model embedding dimensions
no_of_heads: int = 8 # Number of attention heads
no_of_decoder_layers: int = 6 # Number of decoder layers
max_lr: float = 6e-4 # Maximum learning rate
# ... additional parameters
python trainer.py
Launch the interactive Gradio web interface:
python launch_app.py
The app will be available at http://localhost:7860 and provides:
total_batch_sizecheckpoints/The model is trained on the Hindi-English Samanantar dataset:
SmolTransformer/
βββ config.py # Model configuration and hyperparameters
βββ model.py # Transformer model implementation
βββ data.py # Dataset loading and preprocessing
βββ tokenizer.py # Tokenizer setup and utilities
βββ trainer.py # Training loop and utilities
βββ inference.py # Text generation functions
βββ install.sh # Installation script
βββ README.md # This file
βββ checkpoints/ # Model checkpoints
βββ generated_data/ # Generated text samples
βββ gradio/ # Gradio interface (optional)
βββ old/ # Backup files
from model import Transformer
from tokenizer import initialize_tokenizer
from inference import topk_sampling, beam_search_corrected
# Initialize model and tokenizer
tokenizer = initialize_tokenizer()
model = Transformer(src_vocab_size=len(tokenizer), tgt_vocab_size=len(tokenizer))
# Generate text
prompt = "Hello, how are you?"
generated = topk_sampling(model, prompt, tokenizer, device="cuda", max_length=50)
print(generated)
Modify data.py to load your dataset:
def load_datasets(token, sample_size=None):
# Load your custom dataset here
dataset = load_dataset("your_dataset")
return dataset
Adjust parameters in config.py:
embeddings_dims = 768 # Larger model
no_of_heads = 12 # More attention heads
no_of_decoder_layers = 12 # Deeper model
This project is open source and available under the MIT License.
Feel free to submit issues, fork the repository, and create pull requests for any improvements.
If you use this code in your research, please cite:
@misc{smoltransformer2024,
title={SmolTransformer: A Compact Encoder-Decoder Transformer Implementation},
author={Your Name},
year={2024},
url={https://github.com/yourusername/SmolTransformer}
}