--- language: en license: apache-2.0 tags: - sentence-embeddings - transformers - bert - contrastive-learning datasets: - snli --- # Embedder (SNLI) ## Model Description A lightweight BERT-style encoder trained from scratch on SNLI entailment pairs using an in-batch contrastive loss and mean pooling. ## Training Data - Dataset: SNLI (`datasets` library) - Filter: label == entailment - Subsample: 50,000 pairs from the training split - Corpus for tokenizer: premises + hypotheses from the filtered pairs ## Tokenizer - Type: WordPiece - Vocab size: 30,000 - Min frequency: 2 - Special tokens: `[PAD] [UNK] [CLS] [SEP] [MASK]` - Max sequence length: 128 ## Architecture - Model: `BertModel` (trained from scratch) - Layers: 6 - Hidden size: 384 - Attention heads: 6 - Intermediate size: 1536 - Max position embeddings: 128 ## Training Procedure - Loss: in-batch contrastive loss (temperature = 0.05) - Pooling: mean pooling over token embeddings - Normalization: L2 normalize sentence embeddings - Optimizer: AdamW - Learning rate: 3e-4 - Batch size: 64 - Epochs: 2 - Device: CUDA if available, else MPS on macOS, else CPU ## Intended Use - Learning/demo purposes for embedding training and similarity search - Not intended for production use ## Limitations - Trained from scratch; quality is lower than pretrained encoders - Trained only on SNLI entailment pairs - No downstream evaluation provided ## How to Use from transformers import BertModel, BertTokenizerFast model_id = "your-username/embedder-snli" tokenizer = BertTokenizerFast.from_pretrained(model_id) model = BertModel.from_pretrained(model_id) ## Citation @inproceedings{bowman2015snli, title={A large annotated corpus for learning natural language inference}, author={Bowman, Samuel R. and Angeli, Gabor and Potts, Christopher and Manning, Christopher D.}, booktitle={Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing}, year={2015} }